Re: [greenstone-users] encoding again

From Richard Managh
DateThu, 16 Mar 2006 15:44:58 +1300
Subject Re: [greenstone-users] encoding again
In-Reply-To (4418794A-8080607-gmx-net)
Hi Jens,

I'm not aware of any way of directly looking at what's in the mg(pp) indexes.

Some things to try:

o Perhaps it's a problem matching your inputted text with that in the index when you submit search queries. In all cases when you test searching are your input characters the same encoding as what greenstone expects? (the "w" argument)

o Some versions of Perl sometimes get confused and double encode UTF-8 when the xml parser parses your archives directory during the build phase. If you are running perl 5.8, try 5.6.

DL Consulting
Greenstone Digital Library and Digitisation Specialists

jens wille wrote:
hi there!

it's time for me to ask for help with some encoding problems again ;-)

i'm building a collection using mgpp (v2.62, same with v2.63), the
source files are in utf8 and are processed with HTMLPlug. however,
i'm unable to search for terms containing umlauts :-( (the archives
files are correct utf8, so what goes wrong here has to be during
build phase)
a bit of examination lead me to the assumption that my metadata are
"decoded" (where? to what encoding? why?) and then encoded to utf8
_twice_ (!) - oddly enough, this used to work with mg (though i
couldn't find any difference between mg and mgpp in this regard).

now i wanted to break my umlauts (□ => ae, ...) which i'm doing for
other diacritics (□ => c, ...) all along (using the filter_text
function; and which worked and still does - apparently!), but no
change for the umlauts: still no results.

my question now is (apart from general help regarding this problem)
how i could have a look into the mg(pp)-index, to see what mg(pp)
actually has in there. there's db2txt for the text db, but this
doesn't seem to work for the index db. (besides, the Queryer shows
the same behaviour and isn't of much help here - at least not that i
know of)

again, any help would be greatly appreciated ;-)


