[greenstone-users] Language support

From Anna Huang of Greenstone Team
DateWed Jul 9 17:28:02 2008
Subject [greenstone-users] Language support
In-Reply-To (486480B5-4030507-realss-com)
Hi Wolfgang,

Greenstone does support Chinese word segmentation and searching. Like
said in Katherine's reply in 2004 (copied below), if you are using
Greenstone 2.50 or later, you can add an option to the collection config
file (gsdl/collect/<collname>/etc/collect.cfg):
add in the line

separate_cjk true

This will enable the segmentation when Greenstone parses a document.
Please note that although it has cjk in its name, it only works for
Chinese characters at the moment.

Our implementation is rather simple: we detect Chinese characters in a
document and automatically insert a special symbol (zero width space
character) right before and after each Chinese character to delineate
its boundary. If you and your colleague have any open-source tool for
Chinese word segmentation, could you please sent it to me? We would love
to have a more subtle implementation of Chinese word segmentation. We
used to have such a package but it's no longer available due to
copyright issues.

Once the collection is successfully built with the "separate_cjk"
option, you should be able to search for Chinese words and single
characters. As a working example, please check our Chinese demo
collection at
http://www.nzdl.org/cgi-bin/library?a=p&p=about&c=chinese&l=zh&nw=utf-8.
It is built with the "separate_cjk" option and support search for
Chinese words and character. Use quotes to search for an exact phrase,
for example "??".

We have also made the segmentation function available for the metadata
files recently, so that metadata values in Chinese specified in the
metadata.xml files can also be segmented and searched. It will be
available in the next Greenstone release.

We are always keen on promoting Greenstone in China. We are eager to
enrich our example list (at http://www.greenstone.org/examples) with
more Chinese collections. Besides, we have completed the Chinese
interface in the Greenstone 2.80 version as well. This includes both the
User Interace and the Greenstone Librarian Interface, so now Greenstone
users can build and view collections all in a Chinese language
environment. We hope this help to promote Greenstone in China. You are
warmly welcomed to test the Chinese interface in 2.80 version and any
comment/suggestions are appreciated.

Cheers,
Anna Huang

Katherine Don wrote:

> >> Hi Wolfgang
> >>
> >> If you are using Greenstone 2.50 or later, you can add an option to
> >> the collection config file (gsdl/collect/<collname>/etc/collect.cfg):
> >> add in the line
> >>
> >> separate_cjk true
> >>
> >> This will separate Chinese characters by spaces for indexing and
> >> querying. Unfortunately, this option has not made it into the GLI yet,
> >> so if you are using the GLI to build your collection, you will need to
> >> close the collection, add the option in to the config file by hand,
> >> then reopen the collection in GLI. You wont be able to see the option,
> >> but it will be there.
> >>
> >> (Note that while the option has cjk in its name, it only works for
> >> chinese characters at the moment.)
> >>
> >> Regards,
> >> Katherine Don
> >>
> >> Wolfgang Scheuing wrote:
>
>> >>> Dear all,
>> >>>
>> >>> I want to use Greenstone to search for content in my documents, which
>> >>> are in
>> >>> german and chinese. I am not very familiar with Greenstone. Searching
>> >>> for
>> >>> german content is no problem. To search for chinese content is
>> >>> difficult,
>> >>> because in some textes there are no spaces between chinese characters
>> >>> and
>> >>> Greenstone takes the whole paragraph as a "word". How can I "force"
>> >>> Greenstone to index single chinese characters, even if there is no
>> >>> spacing?
>> >>>
>> >>> If there is some information about this topic please give me the link
>> >>> that I
>> >>> can do on reading by myself.
>> >>>
>> >>> Thanks in advance!
>> >>>
>> >>> Wolfgang
>> >>>
>>
> >>
> >>


Wolfgang Scheuing wrote:
> Hello!
>
> I am interested in this project for 4 years now, but I am still
> concerned about the language support. Now I have once more an inquiry of
> a Chinese language university. They want to collect articles in English,
> Russian, Japanese, Arabic, Spanish, German and probably Chinese.
>
> Afaik the western languages are no problem. Arabic, Japanese and Russian
> I didn't test so far. For 4 years I hope to see Chinese working
> properly, but it doesn't. There are many "examples" but NONE of them I
> have tested is working correctly. I don't think they are all fake, but
> can anyone explain to me how he can manage to find a Chinese word in a
> document?
>
> I asked this question 4 years ago (you can see it on the mailing list)
> and a former colleague of mine solved the problem (this is what I
> thought that time). But if I check the current examples it still doesn't
> work. Maybe Greenstone didn't accept his patch. I don't know.
>
> Greenstone could be such a phantastic software if it would be usable.
> but if I cannot find a single Chinese word it isn't usable to me, nor to
> the customers. I almost gave up pushing such projects in China because
> the software doesn't work. As we also have been rejected as a Greenstone
> partner I guess Greenstone is not interested in getting more popular in
> China. As I said before, we tried for 4 years now and I guess this is
> our last trial.
>
> Maybe I am wrong. Simply tell me just 1 working collection with Chinese
> and will completely change my mind.
>
> Best regards
>
> Wolfgang
>
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>
>