Re: [greenstone-users] encoding again

From Stefan Boddie
DateFri, 17 Mar 2006 02:17:40 +1300
Subject Re: [greenstone-users] encoding again
In-Reply-To (44196089-5040103-gmx-net)
Hi Jens,

Richard's suggestion is maybe not as mad as it sounds. We've had occasional problems with the XML Parser module in perl, where older versions work ok but newer ones mess up the encoding (by appearing to encode to UTF-8 twice, turning two-byte characters into four-byte characters and so on). If the encoding of your archive files is correct, but the text coming out of those archives at build time is messed up, then I'd suspect the XML Parser. If that's the case it shouldn't make any difference if you build with mgpp or mg, you'll have the same problem.

I'd suggest rebuilding your collection with mg, just to see if it works. If it does then it's a problem with mgpp, which seems unlikely, but is possible. If it doesn't then it's likely to be perl, or more correctly perl's XML Parser module.


jens wille wrote:
hi richard!

Richard Managh [16.03.2006 03:44]:
I'm not aware of any way of directly looking at what's in the
mg(pp) indexes.
too bad :-(

o Perhaps it's a problem matching your inputted text with that in
the index when you submit search queries. In all cases when you
test searching are your input characters the same encoding as
what greenstone expects? (the "w" argument)
no, that's not the problem: first, the "w" argument is correct,
second, what i was trying to do now is to replace all umlauts, so
there are no special characters in the input.

o Some versions of Perl sometimes get confused and double encode
UTF-8 when the xml parser parses your archives directory during
the build phase. If you are running perl 5.8, try 5.6.
*lol* sorry, but that's a bit odd a suggestion, isn't it ;-) rather
i'd like to learn where this happens and how i can avoid it.
(btw: why (and in what respect) does mgpp behave here differently
than mg?).

but maybe this really is where the problem originates, so i will try
to elaborate on that (trace relevant subroutines, print out some
variables, ... - it's just pretty time-consuming, so i wanted to ask
here first).

thanks for your suggestions, anyway!


greenstone-users mailing list