[greenstone-users] search and index Chinese characters

From K M Ku
DateMon Mar 17 12:35:07 2008
Subject [greenstone-users] search and index Chinese characters
In-Reply-To (41520-130-217-240-32-1205378755-squirrel-webmail-scms-waikato-ac-nz)
No, it's not working.

The encoding is utf8, and I selected 'utf8' for Textplug.
Can I send you the data for your testing?

-----Original Message-----
From: Anna Huang [mailto:lh92@cs.waikato.ac.nz]
Sent: Thursday, March 13, 2008 11:26 AM
To: K M Ku
Cc: greenstone-users@list.scms.waikato.ac.nz
Subject: RE: [greenstone-users] search and index Chinese characters

Hi Mr Ku,

Have you solved this problem? I hope so. Anyway, I've tried it here on a
Linux machine and searching for Chinese characters works. However, the
setting of my little experiment could be different from yours and here is
how I did it.

First, the imported files in my collection are in Simplified Chinese and
encoded with ISO-8859. Then to process the file correctly, I set the
input_encoding of TextPlug to gb because my files are .txt files. Also,
add "separate_cjk true" in the main.cfg file and build with mgpp indexer.
Make sure the documents have enough words so that indexes can be built

So I would suggest you check the encoding of the imported files and make
sure the input_encoding option of the plugin that process the documents is
set properly--should be the same as the encoding for the imported files.

Greenstone used to have a segmentation tool, which splits chinese
characters, so as to build searching indexes with characters. But we were
not allowed to distribute it because it was copyrighted and we don't have
it now. So it would be great if you or other users in HongKong can
contribute one as well.

Hope this helps,
Anna Huang

> Yes, I have already added
> buildtype mgpp
> separate_cjk true
> but I still cannot able to search any Chinese characters.
> -----Original Message-----
> From: Anna Huang [mailto:lh92@cs.waikato.ac.nz]
> Sent: Wednesday, January 23, 2008 9:31 AM
> To: K M Ku
> Cc: greenstone-users@list.scms.waikato.ac.nz
> Subject: Re: [greenstone-users] search and index Chinese characters
> Hi,
> Yes, you should add "separate_cjk true" to the collect.cfg file. Besides,
> which indexer are you using? If it's not mgpp, please try re-build the
> collection with mgpp. You can change the indexer by going to the Search
> Indexes section on the Design panel, or manually change the buildtype to
> mgpp in the collect.cfg file.
> Hope this helps,
> Anna
>> I followed others' comment , and entered separate_cjk true in
>> collect.cfg
>> file.
>> However, search for Chinese character is still not possible.
>> I read wiki, and found the separate_cjk is an option of import.pl. how
>> can
>> I implement Chinese searching?
>> _______________________________________________
>> greenstone-users mailing list
>> greenstone-users@list.scms.waikato.ac.nz
>> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users