Re: [greenstone-users] encoding again

From John R. McPherson
DateFri, 17 Mar 2006 09:56:13 +1300
Subject Re: [greenstone-users] encoding again
In-Reply-To (4419B27C-9020507-gmx-net)
On Thu, Mar 16, 2006 at 07:46:20PM +0100, jens wille wrote:
> hi john!
>
> John R. McPherson [16.03.2006 19:30]:
> > If you are using the XML::Parser module that came with
> > greenstone, then you could try removing it and installing the
> > version that comes with your linux distribution.
> no, i already (re-)moved greenstone's to use my site's modules.
>
> > The problem I've noticed with perl5.8 seems to be that perl 5.8
> > has much better unicode support, but for backwards compatibility
> > assumes that input/output streams are in iso-8859-1/Latin and
> > converts it (!). "use encoding ':utf8'" is supposed to fix that,
> > but doesn't work with perl5.6.
> i could try that, thanks!
>
> > Right, well that sounds like a bug in the ensure_utf8() function
> > then. Which version of greenstone are you using? Perhaps you
> > could email me a sample input document that causes this.

> i'm using v2.62, but also tested with v2.63. sample document attached.

Ok,
it looks like greenstone is guessing the wrong encoding for the file
when it reads it - in this case, it is guessing 'windows_1252' when
it reads in the data, and converting it from cp1252 to utf-8.

You can force an encoding for all files for a plugin with the
'-input_encoding' plugin option. There might also be something you could
put in your html source document, although I'm not sure if HTMLPlug
uses the encoding specified in the document. If it doesn't, then it should
be fixed to use a declared encoding :)

It's a bit difficult to find out what encoding greenstone guesses - I'll
add something so that it prints it out when the '-verbosity 3' argument
is used.

John