[Fwd: Re: [greenstone-users] Hindi collection]

From ak19@cs.waikato.ac.nz
DateFri Mar 14 15:56:08 2008
Subject [Fwd: Re: [greenstone-users] Hindi collection]
(Removed attachments - else it was bounced back by Greenstone-users
mailing list.)

---------------------------- Original Message ----------------------------
Subject: Re: [greenstone-users] Hindi collection
From: ak19@cs.waikato.ac.nz
Date: Fri, March 14, 2008 2:41 pm
To: "Debjani Saha" <sahadebjani@yahoo.com>
Cc: greenstone-users@list.scms.waikato.ac.nz

Hi Debjani,

Thanks for the documents. I had to change a few things about them because
- zzz.html was not recognised and not opening in a text editor
- I had to install a Bengali font. I did not install any Hindi ones
because I assumed that if searching works on English and Bengali
characters, it will work on Hindi too. Besides, since Hindi also uses the
Devanagiri script like Bengali, I presumed it would come down to the same
- Some of the filenames needed to be renamed because in spite of my having
installed a Bengali font, I was unable to view the filenames in Bengali.
- I changed some of the text contents of the documents as well. Usually
this merely involved adding some random English words in there. I am
attaching your slightly modified documents, so try a collection with this.

(Solution suggested by Dr Bainbridge.)
A very important change I made to your html files is to specify that the
encoding is UTF-8, so that if the html contains multi-lingual content it
indicates this to Greenstone which then knows to deal with it.
I do this by adding the following meta tag into the head tag. For
instance, as the file yy.html now contains:

<title>Test and &#2486;&#2488;&#2494;</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

You need to add this tag as given above to each of your multi-lingual html
files. (Alternatively, you could try to uniformly add this tag to ALL your
documents, even if they're pure English.) This step will make the Bengali
contents of your multi-lingual html file searchable as well. The searching
then works in my case.

Another rather minor change I made was to add some text in the first line
of 1.doc (I added some English words, but it might as well have been
Bengali). Otherwise 1.doc does not show up in my browse classifier. I
think this is because MS Word figures out the "title" of a document from
the first line or so.

Try the following:
(1) Download the attached files 1.doc, 2.doc, 3.doc, xx.html and yy.html
(2) Have a look in them so you understand what each contains and know what
to text strings to search for and which documents you can expect to be
returned for a search.
(3) Create a new collection with these 5 files and build it
(4) View this new collection in the web browser. First you may want to
choose to view the files by browsing categories/classifiers (the Titles
and Filenames options next to search) just to make sure they all show up
in both titles and filenames browse categories.
(5) Perform some searches on some of the Bengali or English strings that
you know are in one or more of your documents. You may even want to search
for a combination of Bengali and English words, such as what's in the html
title tag of yy.html. I don't know if it shows up here, but I am referring
to: Test and &#2486;&#2488;&#2494;
For example, try performing a search for those three words in the *text*
fields of your documents as well in their *title* fields. In the first
case, it returns the three files 3.doc, yy.html and xx.html. When I search
for the string in the title field, I get back just yy.html.

Tell me how you get on. And remember that your html documents need that
meta tag that I mentioned earlier in order to be able to successfully
index the contents for searching.


> Thanking you for your cooperation. I am attaching herewith sample of my
> documents.
> Regards,