Re: [greenstone-users] a problem about chinese character search.

From Katherine Don
DateMon, 02 Aug 2004 09:01:08 +1200
Subject Re: [greenstone-users] a problem about chinese character search.
In-Reply-To (410AFB36-1040102-realss-com)

You can do searching for Chinese content whether you have sparate_cjk in
your config file or not.
The problem with Chinese is that it does not have spaces in it, so the
indexer does not know where to put word boundaries. The indexer will
separate the text into words at spaces or punctuation, meaning whole
sentences may be treated as one word. This makes it very hard to find
anything. Word segmentation is a complex problem, and we provide a
simple solution - treat every character as a word. This is what happens
if you have the option 'separate_cjk true' in your config file.
Each character is indexed separately, and when you do a query, the query
text is separated into characters for looking up in the index.

You will get all the matching documents, and some false matches. We
think this is a better solution than getting no matches which is what is
likely to happen if we don't do anything to the input text.

One of our PhD students implemented proper Chinese segmentation using
text mining techniques. This is not part of the standard Greenstone
distribution and I'm not sure where to get the code or how to put it in.
If you want this please email the list with this specific request.

Alternatively, if you pre-segment your source documents into words, then
you can build the collection without the separate_cjk option, and it
will do proper word searching as long as the query is segmented too.

Katherine Don

tangkai wrote:
> Dear all,
> I am a normal user of Greenstone.
> I think there is a problem with chinese word search with greenstone.
> I added "separate_cjk true" to the collection config file
> (gsdl/collect/<collname>/etc/collect.cfg).
> And then i can search the chinese content but when i type in a chinese
> word for searching ,
> the searching engine treat the word as some seperated character.
> So i cannot get a accurate result when searching.
> Are there some ways to solve this problem?
> If there is some information about this topic please give me the link
> that I can do on reading by myself.
> Thank you very much.
> tangkai.
> _______________________________________________
> greenstone-users mailing list