Re: [greenstone-devel] A small idea for GS!

From John R. McPherson
DateMon, 14 Mar 2005 14:04:07 +1300
Subject Re: [greenstone-devel] A small idea for GS!
In-Reply-To (1110751560-9085-15-camel-puriri-cs-waikato-ac-nz)
On Mon, 2005-03-14 at 11:06 +1300, John R. McPherson wrote:
> On Fri, 2005-03-11 at 07:23 -0800, Leho@nq wrote:
>
>
> > 4. GS couldn't display some text files(.txt) correctly. They were
> > encoded in Unicode (in Vietnamese). I used Notepad (WindowsXP) and
> > could read them, but when I added them to a collection, GS could not
> > display them the way I expected. Although I knew if it was a Word
> > file(.doc), (have the same text), GS could. It's quite hard to
> > express, so I attacked them with this mail. If you like to try, please
> > test these files for me. If GS doesn't display wrong character
> > (square, sigma,...) I'm wrong :)
>
> Hi,
> there are several problems causing this to happen. Firstly, greenstone
> was guessing the wrong input encoding for your documents. We use a cool
> little package that guesses the language and encoding based on the byte
> distribution in the input, but unfortunately we don't have a model for
> Vietnamese in unicode, so it guesses it is an iso-8859 encoding and
> 'converts' it into the incorrect byte sequences.
>
> Normally, you could correct this by editing the collect.cfg file (or
> using the GLI) to say that the -input_encoding is unicode, and the
> -default_language is "vi", but it seems that greenstone gets it wrong if
> your files are encoded in "little endian" unicode (which Microsoft
> programs do by default) instead of the normal "big endian" unicode.
>
> We'll have a look at this and see if we can fix greenstone to both
> automatically detect the encoding, and work properly with little endian
> 16bit unicode.

Ok, it looks like if you specify '-input_encoding utf8' in the
collect.cfg file then it will automatically detected big-endian unicode
or little-endian unicode. That seems a bit weird, so I've fixed it.
If you download http://www.greenstone.org/tmp/multiread.pm and copy that
over the current multiread.pm file in the <$GSDLHOME>/perllib directory,
then you should be able to to just put
TEXTPlug -input_encoding unicode
and greenstone will automatically figure out which endian the unicode
file is using.

In addition, if you also download a new version of
http://www.greenstone.org/tmp/BasPlug.pm and override the existing
BasPlug.pm file in <$GSDLHOME>/perllib/plugins, then it will
automatically determine unicode files without having to use the
-input_encoding option.

John

ps - I couldn't index the PDF file, because the person who made it
locked it to prevent text extraction from it.