Re: [greenstone-users] encoding again

From John R. McPherson
DateFri, 17 Mar 2006 07:30:44 +1300
Subject Re: [greenstone-users] encoding again
In-Reply-To (44196A09-2070209-gmx-net)
On Thu, Mar 16, 2006 at 02:37:13PM +0100, jens wille wrote:
> hi stefan!
> Stefan Boddie [16.03.2006 14:17]:
> > Richard's suggestion is maybe not as mad as it sounds. We've had
> > occasional problems with the XML Parser module in perl, where
> > older versions work ok but newer ones mess up the encoding (by
> > appearing to encode to UTF-8 twice, turning two-byte characters
> > into four-byte characters and so on).
> well, ok, might be ;-) i just was a bit irritated by such a
> suggestion - downgrading perl, hm...

If you are using the XML::Parser module that came with greenstone,
then you could try removing it and installing the version that comes
with your linux distribution.
The problem I've noticed with perl5.8 seems to be that
perl 5.8 has much better unicode support, but for backwards compatibility
assumes that input/output streams are in iso-8859-1/Latin and converts
it (!). "use encoding ':utf8'" is supposed to fix that, but doesn't work
with perl5.6.

> in an earlier test i printed out the values of $value in
> doc::add_utf8_metadata() before and after unicode::ensure_utf8()
> which got me (e.g.):
> €ckeburg / bückeburg
> bückeburg / €203¼ckeburg

Right, well that sounds like a bug in the ensure_utf8() function then.
Which version of greenstone are you using? Perhaps you could email me
a sample input document that causes this.