Re: [greenstone-users] problem with text and html

From John R. McPherson
DateSun, 12 Oct 2003 17:09:35 +1300
Subject Re: [greenstone-users] problem with text and html
In-Reply-To (1532-192-168-2-17-1065789473-squirrel-drtc-isibang-ac-in)
On Fri, Oct 10, 2003 at 06:07:53PM +0530, Aditya Tripathi wrote:
> Hi,
> I am not getting display for text and html files hindi documents. Where
> as doc and pdf goes fine and display comes proper...
> I have observed for text and html hindi characters are broken in three
> nonsencical characters (I know this is coz of UTF-8).
> I am using:
> 1. Linux for collection building
> 2. Accessing collection from WindowsXP machine, which has required fonts
> for hindi.
> Why I am not getting display?????

This definitely sounds like an encoding issue. Greenstone's plugins
try to guess what encoding to use for files if you don't explicitly set
the encoding, but it doesn't always get this 100% correct. We use a program
called "textcat" to try to guess the language and encoding, and hindu/iscii
is the only Indian encoding we can detect (other than unicode/utf-8).

If your text and html files are encoded in utf-8, try adding the input_encoding
flag to your plugins in the etc/collect.cfg file. Eg:

HTMLPlug -input_encoding utf8 -default_language hi
TEXTPlug -input_encoding utf8 -default_language hi

if your files are in utf8, or use "iscii_de" instead of utf8 if your input
files are in ISCII Devanagari encoding.

("hi" is the 2-letter code for Hindi).

Hope this helps
John McPherson