Re: Searching Problem

From Stefan Boddie
DateSun, 06 Apr 2003 21:06:28 +1200
Subject Re: Searching Problem
In-Reply-To (1049521851-3e8e6ebb2313a-IMP-Lehigh-EDU)
Regarding version information, there should be a file called
gsdl/etc/VERSION that tells you the greenstone version
number. If you don't have this file then you've probably got
a rather old version. I'd recommend upgrading to gsdl-2.39
if at all possible, though you'll need to ask John what
changes he made to your installation and if they could be
easily moved to an updated installation.

Regarding the searching problem you mentioned. There are two
possibilities I can think of, neither of which really
explain it well:

1. The chinese text is being segmented and indexed as single
characters as discussed below. This shouldn't have any
effect to words like "A119" though as it contains no chinese
characters. In gsdl/perllib/plugins/ there's a
line something like the following:
$$textref = &cnseg::segment($$textref);
If you comment out or remove that line you'll prevent
greenstone from ever attempting to segment the input text.

2. MG splits numbers up to be a maximum of 4 digits to
prevent it's dictionary from becoming too large. For
example, if your source document contained the token 1234567
mg would index it as 1234 567. Searching for 1234567 would
retrieve no hits while searching for 1234 or 567 would. This
also shouldn't effect "A119" though as it's less than four
digits. This problem makes searching on many ID number type
fields a little difficult. I'm working on a solution to this
and it should be in CVS soon.

Stefan. wrote:
> Hi, All,
> I am working on a project of Ancient Chinese Names and Vessels, which had a
> trial run in late 2001 by a couple of developers. Unfortunately, we lost the
> contact with them now.
> I have the same searching problem as posted by one the developers Hongyan (see
> the message below, dated Mon, 10 Sep 2001 ). Acturally, the problem is not
> limited to Chinese, e.g., "A119" (ID number) cannot be searched even
> though "A119" does exist. And I think we are using the "" Stefan
> provided in the message.
> Our source documents are encoded as UTF-8. We cannot use simplified Chinese
> (GB).
> I am not sure which version of Greenstone we are using, where can I check
> the version information? John McPherson has helped to update part of Greenstone
> (Digital Bridges Project in Lehigh University) according to our system
> administrator.
> Please give us advices and suggestions.
> Thanks,
> Min Zhang
> -----------------------------------------------------------------------------
>>2: When I do chinese search and I want to search a whole word in Chinese
>>as "ZhongWen" but it always tells me that there is no "ZhongWen" in my
>>documents while indeed I have it. But when I seperate "ZhongWen" in my
> original
>>colletions to "Zhong" "Wen" with a space in bwteeen. The search can work. It
>>seems to me that it can not seperate "ZhongWen" in my original documents
>>automatically to two Charactors but I have to use space to seperate them. But
>>the Chinese demo on The New Zealand Digital Library works perfectly without
> any
>>space between the original documents. So I wonder what I did wrong with my
> I suspect this was caused because you built your collection from
> documents encoded as Big5 or UTF-8. Is that right?
> The problem is that Chinese text segmentation isn't currently fully
> supported by Greenstone. At the moment we do simple segmentation of GB
> encoded text (i.e. each character is treated as a word) but don't do any
> segmentation for Chinese text encoded as UTF-8 or Big5.
> There's no technical reason that Big5 and UTF-8 can't be treated exactly
> the same as GB. This bug is simply an oversite because GB was once the
> only form of Chinese Greenstone supported and because I've left the
> Chinese support alone for the time being (we hope to add "proper"
> Chinese text segmentation in the near future).
> To make it work as you'd expect you should download
> to replace the one in your
> gsdl/perllib/plugins directory then rebuild the collection.
> Please let me know if this doesn't solve the problem.
> regards,
> Stefan.
> -------------------------------------------------
> This mail sent through IMP: