| From | John R. McPherson |
| Date | Mon, 14 Mar 2005 14:04:07 +1300 |
| Subject | Re: [greenstone-devel] A small idea for GS! |
| In-Reply-To | (1110751560-9085-15-camel-puriri-cs-waikato-ac-nz) |
| On Mon, 2005-03-14 at 11:06 +1300, John R. McPherson wrote:
> On Fri, 2005-03-11 at 07:23 -0800, Leho@nq wrote: > > > > 4. GS couldn't display some text files(.txt) correctly. They were > > encoded in Unicode (in Vietnamese). I used Notepad (WindowsXP) and > > could read them, but when I added them to a collection, GS could not > > display them the way I expected. Although I knew if it was a Word > > file(.doc), (have the same text), GS could. It's quite hard to > > express, so I attacked them with this mail. If you like to try, please > > test these files for me. If GS doesn't display wrong character > > (square, sigma,...) I'm wrong :) > > Hi, > there are several problems causing this to happen. Firstly, greenstone > was guessing the wrong input encoding for your documents. We use a cool > little package that guesses the language and encoding based on the byte > distribution in the input, but unfortunately we don't have a model for > Vietnamese in unicode, so it guesses it is an iso-8859 encoding and > 'converts' it into the incorrect byte sequences. > > Normally, you could correct this by editing the collect.cfg file (or > using the GLI) to say that the -input_encoding is unicode, and the > -default_language is "vi", but it seems that greenstone gets it wrong if > your files are encoded in "little endian" unicode (which Microsoft > programs do by default) instead of the normal "big endian" unicode. > > We'll have a look at this and see if we can fix greenstone to both > automatically detect the encoding, and work properly with little endian > 16bit unicode. Ok, it looks like if you specify '-input_encoding utf8' in the
In addition, if you also download a new version of
John ps - I couldn't index the PDF file, because the person who made it
| |