[greenstone-users] search and index Chinese characters

From Anna Huang
DateMon Mar 17 14:59:57 2008
Subject [greenstone-users] search and index Chinese characters
In-Reply-To (001e01c8858a$04766c30$0d634490$-hk)
Hi Mr Ku,

I've figured out why Chinese character searching doesn't work for you. In
your case the Chinese characters are from the metadata values in the
metadata.xml files, however, Greenstone only do character segmentation on
document text at the moment.

To enable segmentation for metadata files, you need to change the
MetadataXMLPlug.pm. In MetadataXMLPlug.pm (in $GSDLHOME/perllib/plugins),
in line 296, add the following code

if ($self->{'separate_cjk'}) {
# segment the Chinese words
$mvalue = &cnseg::segment($mvalue);
}

Since the metadata files are already in UTF-8, you don't need to set the
plugins' input_encoding and default_encoding options. UTF-8 is the
encoding by default. Just remove these two options from the configuration
file.

Besides, the "separate_cjk true" should be added to the collect.cfg file,
not the main.cfg, because collect.cfg is the one involved with collection
building.

Regards,
Anna Huang


> No, it's not working.
>
> The encoding is utf8, and I selected 'utf8' for Textplug.
> Can I send you the data for your testing?
>
> -----Original Message-----
> From: Anna Huang [mailto:lh92@cs.waikato.ac.nz]
> Sent: Thursday, March 13, 2008 11:26 AM
> To: K M Ku
> Cc: greenstone-users@list.scms.waikato.ac.nz
> Subject: RE: [greenstone-users] search and index Chinese characters
>
> Hi Mr Ku,
>
> Have you solved this problem? I hope so. Anyway, I've tried it here on a
> Linux machine and searching for Chinese characters works. However, the
> setting of my little experiment could be different from yours and here is
> how I did it.
>
> First, the imported files in my collection are in Simplified Chinese and
> encoded with ISO-8859. Then to process the file correctly, I set the
> input_encoding of TextPlug to gb because my files are .txt files. Also,
> add "separate_cjk true" in the main.cfg file and build with mgpp indexer.
> Make sure the documents have enough words so that indexes can be built
> properly.
>
> So I would suggest you check the encoding of the imported files and make
> sure the input_encoding option of the plugin that process the documents is
> set properly--should be the same as the encoding for the imported files.
>
> Greenstone used to have a segmentation tool, which splits chinese
> characters, so as to build searching indexes with characters. But we were
> not allowed to distribute it because it was copyrighted and we don't have
> it now. So it would be great if you or other users in HongKong can
> contribute one as well.
>
> Hope this helps,
> Anna Huang
>
>> Yes, I have already added
>>
>> buildtype mgpp
>> separate_cjk true
>>
>> but I still cannot able to search any Chinese characters.
>>
>>
>> -----Original Message-----
>> From: Anna Huang [mailto:lh92@cs.waikato.ac.nz]
>> Sent: Wednesday, January 23, 2008 9:31 AM
>> To: K M Ku
>> Cc: greenstone-users@list.scms.waikato.ac.nz
>> Subject: Re: [greenstone-users] search and index Chinese characters
>>
>> Hi,
>>
>> Yes, you should add "separate_cjk true" to the collect.cfg file.
>> Besides,
>> which indexer are you using? If it's not mgpp, please try re-build the
>> collection with mgpp. You can change the indexer by going to the Search
>> Indexes section on the Design panel, or manually change the buildtype to
>> mgpp in the collect.cfg file.
>>
>> Hope this helps,
>> Anna
>>
>>
>>> I followed others' comment , and entered separate_cjk true in
>>> collect.cfg
>>> file.
>>>
>>> However, search for Chinese character is still not possible.
>>>
>>>
>>>
>>>
>>>
>>> I read wiki, and found the separate_cjk is an option of import.pl. how
>>> can
>>> I implement Chinese searching?
>>>
>>>
>>>
>>> _______________________________________________
>>> greenstone-users mailing list
>>> greenstone-users@list.scms.waikato.ac.nz
>>> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>