Re: Encoding problem with Greenstone

From John R. McPherson
DateThu, 03 Apr 2003 12:31:20 +1200
Subject Re: Encoding problem with Greenstone
In-Reply-To (004d01c2f90e$e00b71e0$857c82c3-ionio-gr)
Emmanouil Magkos wrote:
> Hi list, here is my problem..
>
> The library system (Greenstone in a Win2k server with Apache) cannot
> browse the text of my *Greek-language* documents and give search
> results, when these are transformed into html format. On the contrary,
> there is no such problem with English text.. Here follow some changes I
> have tried to make to the config files (the full text are appended at
> the end of this mail).
>
> a1) Main.cfg
>
> Encoding shortname=windows-1253 "longname=Greek (Windows-1253)" map=win1253.ump
> (Removed #)
> and
> Encoding shortname=iso-8859-7 "longname=Greek (ISO-8859-7)" map=8859_7.ump
> (Removed #)
>
> a2) At the end of the main.cfg, I added the line
>
> cgiarg shortname=gr argdefault=iso-8859-7

These settings are used for display purposes - that is, the encoding
greenstone sends data back to your web browser. Greenstone uses unicode
internally, so these settings are used to convert from unicode to
whatever encoding is requested by the user. These settings have no
affect on how the import documents are parsed, and only affect an
already built collection.

> HOWEVER, when the collection is being built, I see a message in the
> log file, saying that BASPlug is not able to find the correct encoding,
> and DEFAULTING to ENGLISH..

The detected language doesn't really matter, unless you are explicitly
going to use the "Language" metadata. The encoding however is used to
convert the document to unicode for the import process.
By the way, the "official" iso-639 two letter code for "Greek" is "el",
which Greenstone should be able to detect (if the import documents are
in iso_8859_7).

> b) I even tried to change things in the collect.cfg file:
>
> ....
> plugin ZIPPlug
> plugin GAPlug
> plugin TEXTPlug
> plugin HTMLPlug -default_encoding -ISO_8859_7
> plugin EMAILPlug
> plugin PDFPlug
> plugin RTFPlug
> plugin WordPlug -default_encoding -ISO_8859_7
> plugin PSPlug
> plugin ArcPlug
> plugin RecPlug
>
> * I tried the same with INPUT_ENCODING, as well as with WINDOWS-1253, but with no Results :'((

The Word plugin always converts to UTF-8 (Unicode), so you probably
don't want to force greenstone to treat it as iso_8859_7.

> Can you please suggest me what should I do?

You don't say which version of greenstone you are using. Also, what
format are your input documents in? If they are PDF or MS Word then it
is possible that the 3rd-party converter programs are failing in some
way. Try creating some test files in plain text or html format using
iso_8859_7 and it should work. In that case, the problem is the
pdftohtml converter or the wvWare (MS word) converter.


John McPherson.