Re: [greenstone-devel] A small idea for GS!

From John R. McPherson
DateMon, 14 Mar 2005 11:06:00 +1300
Subject Re: [greenstone-devel] A small idea for GS!
In-Reply-To (20050311152323-58887-qmail-web54210-mail-yahoo-com)
On Fri, 2005-03-11 at 07:23 -0800, Leho@nq wrote:

> 4. GS couldn't display some text files(.txt) correctly. They were
> encoded in Unicode (in Vietnamese). I used Notepad (WindowsXP) and
> could read them, but when I added them to a collection, GS could not
> display them the way I expected. Although I knew if it was a Word
> file(.doc), (have the same text), GS could. It's quite hard to
> express, so I attacked them with this mail. If you like to try, please
> test these files for me. If GS doesn't display wrong character
> (square, sigma,...) I'm wrong :)

there are several problems causing this to happen. Firstly, greenstone
was guessing the wrong input encoding for your documents. We use a cool
little package that guesses the language and encoding based on the byte
distribution in the input, but unfortunately we don't have a model for
Vietnamese in unicode, so it guesses it is an iso-8859 encoding and
'converts' it into the incorrect byte sequences.

Normally, you could correct this by editing the collect.cfg file (or
using the GLI) to say that the -input_encoding is unicode, and the
-default_language is "vi", but it seems that greenstone gets it wrong if
your files are encoded in "little endian" unicode (which Microsoft
programs do by default) instead of the normal "big endian" unicode.

We'll have a look at this and see if we can fix greenstone to both
automatically detect the encoding, and work properly with little endian
16bit unicode.

John McPherson