Re: [greenstone-users] Hebrew characters

From John R. McPherson
DateMon, 24 Jul 2006 16:04:26 +1200
Subject Re: [greenstone-users] Hebrew characters
In-Reply-To (002e01c6aed4$f1db51e0$6401a8c0-Galya)
On Sun, Jul 23, 2006 at 10:55:02PM -0500, Admin wrote:
> Hello everybody,
> We have few English books with paragphs in Hebrew. As you can see, they
> don't show up corectly
> http://lib.kabbalah.info/cgi-bin/library?e=d-000-00---0newtest--00-0-0--0prompt-10---4----dte--0-1l--1-en-50---20-about-%d7%a4--00031-001-1-0utfZz-8-00&a=d&c=newtest&cl=CL1.3.3&d=HASHb59144edba1cb57bf4a48d.5
>
> We are using UTF encoding in HTML files. All files are OK, if I test them
> from my local machine.
> Please, any advice how to fix this error.

It looks like the input documents have been re-encoded from iso-8859
to utf-8 - probably because greenstone tries to guess the encoding if
you don't specify it, and in this case it has guessed wrong.
There are two ways to fix this: 1 quick fix, and 1 longer term fix.

The quick fix is for you to specify the encoding in your collect.cfg
file. Eg
plugin HTMLPlug -input_encoding utf8
for all your HTML source documents.

The long-term fix is to improve our language detection - I suspect we
don't have any language models for Hebrew. If you can email me (off-list)
several documents in Hebrew then I can add model files (in perllib/textcat)
for it.

John McPherson