Re: [greenstone-users] problems with accent marks in words

From Tod Olson
DateThu, 12 May 2005 15:51:49 -0500
Subject Re: [greenstone-users] problems with accent marks in words
In-Reply-To (5-1-1-6-0-20050512135041-0380a468-piluso-clacso-edu-ar)
>>>>> "F" == Flor <vergara@clacso.edu.ar> writes:

F> In Spanish and Portuguese we use a lot of accent marks in words.
F> When searching in our virtual library, our users sometimes include in their
F> search the accent marks, and sometimes they do not.
F> And in the input sometimes our members include the accent marks and
F> sometimes they do not.

F> How can we do so that Greenstone does not consider accent marks in the
F> input and in the output?

Chopin Early Editions (http://chopin.lib.uchicago.edu/) has the same
issue. The problem of accents in metadata were address, but not
accents in the full text.

With the metadata, we would use two forms: one for indexing and one
for display. The display version of a field has the text with
accents. The index version can have any variants you want to match.

This might be easier to illustrate first with the different spellings of
cities. Take Leipzig. There are more spellings than what's shown below:

<Metadata name="PubPlace">Leipzig</Metadata>
<Metadata name="PubPlaceIdx">Leipsic Leipzig</Metadata>

The CEE publication place index is built from PubPlaceIdx, so it
matches any spelling, but any display is build from the version that
actually appears on printed score, stored in PubPlace.

You can apply this idea to accents, treating them as variant spellings.
There's an example of accents in the title in this previous email:

http://puka.cs.waikato.ac.nz/cgi-bin/library?a=d&c=gsarch&cl=CL2.18.11&d=20041206.104521.846947295.tao-lib.uchicago.edu

This example assumes that the user will type in without accents, but
could be adapted to accomodate either form of query.

Never figured out how to do this for the full text searching.

-Tod