Re: Interface in Indian languages

From John R. McPherson
DateFri, 12 Jul 2002 16:07:40 +1200
Subject Re: Interface in Indian languages
In-Reply-To (Pine-LNX-4-21-0207111245170-8715-100000-ncsi-iisc-ernet-in)
"B.S Shivaram" wrote:


> I have successfully created the GSDL interface in these languages by
> creating required .dm files using Office XP and appropriate Unicode fonts
> for these two languages ("Mangal" font for Hindi language and
> "Tunga" font for Kannada - both these are available in Windows XP). Thus,
> I do not have any problems in creating GSDL interface for these languages.

Great! If you email them to greenstone@cs.waikato.ac.nz (or to me
as a last resort) we can include them in the base greenstone distributon.

> I created two collections - one containing Word DOC files created using
> Word/XP, and another containing the same files, saved as HTML files from
> within Word/XP. Content was entered using the fonts mentioned above. (We
> note that Windows uses Windows-1252 charset).


> However, the HTML files give problem in display - junk is
> displayed. Interestingly, I can copy portion of this junk text into the
> search box - GSDL searches this correctly!

My guess is that either the HTML plugin guessed the wrong encoding for
the files, or they are in an encoding that greenstone doesn't know
about.
After running setup.bat or setup.bash, you can type:
perl -S pluginfo.pl BasPlug
to get lots of information, including the available input encodings.
(It might scroll off your screen: use the "more" command to see
1 page at a time).

You can force HTMLPlug to use a certain encoding by giving it an
argument in the collect.cfg file. Eg:
plugin HTMLPlug -input_encoding iso_8859_1

Greenstone has support for Devenagari and similar - try:
plugin HTMLPlug -input_encoding iscii_de

If you are using an encoding not supported by greenstone, you could
save the files in either unicode or utf-8.

Hope this helps,
John McPherson