Searching Problem

From miz2@lehigh.edu
DateSat, 5 Apr 2003 00:50:51 -0500
Subject Searching Problem
Hi, All,

I am working on a project of Ancient Chinese Names and Vessels, which had a
trial run in late 2001 by a couple of developers. Unfortunately, we lost the
contact with them now.

I have the same searching problem as posted by one the developers Hongyan (see
the message below, dated Mon, 10 Sep 2001 ). Acturally, the problem is not
limited to Chinese, e.g., "A119" (ID number) cannot be searched even
though "A119" does exist. And I think we are using the "BasPlug.pm" Stefan
provided in the message.

Our source documents are encoded as UTF-8. We cannot use simplified Chinese
(GB).

I am not sure which version of Greenstone we are using, where can I check
the version information? John McPherson has helped to update part of Greenstone
(Digital Bridges Project in Lehigh University) according to our system
administrator.

Please give us advices and suggestions.

Thanks,

Min Zhang


-----------------------------------------------------------------------------

> 2: When I do chinese search and I want to search a whole word in Chinese
> as "ZhongWen" but it always tells me that there is no "ZhongWen" in my
> documents while indeed I have it. But when I seperate "ZhongWen" in my
original
> colletions to "Zhong" "Wen" with a space in bwteeen. The search can work. It
> seems to me that it can not seperate "ZhongWen" in my original documents
> automatically to two Charactors but I have to use space to seperate them. But
> the Chinese demo on The New Zealand Digital Library works perfectly without
any
> space between the original documents. So I wonder what I did wrong with my
> setup.

I suspect this was caused because you built your collection from
documents encoded as Big5 or UTF-8. Is that right?

The problem is that Chinese text segmentation isn't currently fully
supported by Greenstone. At the moment we do simple segmentation of GB
encoded text (i.e. each character is treated as a word) but don't do any
segmentation for Chinese text encoded as UTF-8 or Big5.

There's no technical reason that Big5 and UTF-8 can't be treated exactly
the same as GB. This bug is simply an oversite because GB was once the
only form of Chinese Greenstone supported and because I've left the
Chinese support alone for the time being (we hope to add "proper"
Chinese text segmentation in the near future).

To make it work as you'd expect you should download
ftp://nzdl.org/pub/gsdl/tmp/BasPlug.pm to replace the one in your
gsdl/perllib/plugins directory then rebuild the collection.

Please let me know if this doesn't solve the problem.

regards,
Stefan.


-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/