Re: [greenstone-users] problems with accent marks in words

From Katherine Don
DateMon, 16 May 2005 16:01:57 +1200
Subject Re: [greenstone-users] problems with accent marks in words
In-Reply-To (5-1-1-6-0-20050513083732-03304ec8-piluso-clacso-edu-ar)
Hi

An alternative way you could achieve this is to implement a stemmer.

Standard stemming reduces words to their root, eg computer, computing,
computation might all stem to comput. In the index, you use the stemmed
version. Then when a query is done, the search terms are also stemmed,
and therefore will match all variants.

You could implement a stemmer that just removed the accents. So words
with accents would be mapped to those without accents in the index.
This would need to be written in C/C++. Greenstone already has two
stemers (English and a simple French one). There is a mechanism to
choose which one to use.

If you are interested in doing this, I can give you more information
about how to add a new stemmer to Greenstone.

Regards,
Katherine Don

Flor wrote:
> Thank you Tod for your help, I will going to do a paralel version to the
> field.
>
> Regards
>
> Florencia Vergara Rossi
> Biblioteca - Clacso
> vergara@clacso.edu.ar
> http://www.clacso.org.ar/biblioteca
>
> At 15:51 12/05/2005 -0500, you wrote:
>
>> >>>>> "F" == Flor <vergara@clacso.edu.ar> writes:
>>
>> F> In Spanish and Portuguese we use a lot of accent marks in words.
>> F> When searching in our virtual library, our users sometimes include
>> in their
>> F> search the accent marks, and sometimes they do not.
>> F> And in the input sometimes our members include the accent marks and
>> F> sometimes they do not.
>>
>> F> How can we do so that Greenstone does not consider accent marks in the
>> F> input and in the output?
>>
>> Chopin Early Editions (http://chopin.lib.uchicago.edu/) has the same
>> issue. The problem of accents in metadata were address, but not
>> accents in the full text.
>>
>> With the metadata, we would use two forms: one for indexing and one
>> for display. The display version of a field has the text with
>> accents. The index version can have any variants you want to match.
>>
>> This might be easier to illustrate first with the different spellings of
>> cities. Take Leipzig. There are more spellings than what's shown below:
>>
>> <Metadata name="PubPlace">Leipzig</Metadata>
>> <Metadata name="PubPlaceIdx">Leipsic Leipzig</Metadata>
>>
>> The CEE publication place index is built from PubPlaceIdx, so it
>> matches any spelling, but any display is build from the version that
>> actually appears on printed score, stored in PubPlace.
>>
>> You can apply this idea to accents, treating them as variant spellings.
>> There's an example of accents in the title in this previous email:
>>
>> http://puka.cs.waikato.ac.nz/cgi-bin/library?a=d&c=gsarch&cl=CL2.18.11&d=20041206.104521.846947295.tao-lib.uchicago.edu
>>
>>
>> This example assumes that the user will type in without accents, but
>> could be adapted to accomodate either form of query.
>>
>> Never figured out how to do this for the full text searching.
>>
>> -Tod
>
>
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>