Re: [greenstone-users] encoding again

From jens wille
DateThu, 16 Mar 2006 14:37:13 +0100
Subject Re: [greenstone-users] encoding again
In-Reply-To (44196574-5080703-dlconsulting-co-nz)
hi stefan!

Stefan Boddie [16.03.2006 14:17]:
> Richard's suggestion is maybe not as mad as it sounds. We've had
> occasional problems with the XML Parser module in perl, where
> older versions work ok but newer ones mess up the encoding (by
> appearing to encode to UTF-8 twice, turning two-byte characters
> into four-byte characters and so on).
well, ok, might be ;-) i just was a bit irritated by such a
suggestion - downgrading perl, hm...

> If the encoding of your archive files is correct, but the text
> coming out of those archives at build time is messed up, then I'd
> suspect the XML Parser. If that's the case it shouldn't make any
> difference if you build with mgpp or mg, you'll have the same
> problem.
in an earlier test i printed out the values of $value in
doc::add_utf8_metadata() before and after unicode::ensure_utf8()
which got me (e.g.):

b?ckeburg / bückeburg
bückeburg / b?203¼ckeburg

from which i assumed that it was encoded twice - however, i don't
know where and why this happens yet. (at least this seems to imply
that it's not a perl issue, right?)