Re: [greenstone-users] accents in metadata

From Tod Olson
DateMon, 06 Dec 2004 10:45:21 -0600 (CST)
Subject Re: [greenstone-users] accents in metadata
In-Reply-To (fc-00664147002aa82000664147002aa820-2aa82b-campus-clacso-edu-ar)
>>>>> "DB" == Dominique Babini <dbabini@campus.clacso.edu.ar> writes:

DB> What change do we have to make so that our users can search a word
DB> indistinctly with or without accents in the metadata fields
DB> (spanish has a lot of accents!!!!).

We had the same problem in Chopin Early Editions
(http://chopin.lib.uchicago.edu/): lots of diacritics, but lot of
users with US keyboards. The problem was addressed by processing the
metadata in the GSAF files. Not ideal, but it required no
modifications to the indexing engine.

The idea is that from any metadata field, you can create a parallel
version of the field that is modified to suit your search criteria.
The modified version of the metadata is used for indexing, and the
original is used for display. Here's some real title metadata;

<Metadata name="Title">Prélude en ut dièse mineur, op. 45</Metadata>

After the GSAF files are created, we run them through a filter which
looks for certain metadata fields, like Title, creates a new field
with "Idx" appended to the name and all the diacritics stripped out:

<Metadata name="TitleIdx">Prelude en ut diese mineur, op. 45</Metadata>

So if you title search the collection for the string "Prelude en ut
diese mineur", the engine will match on the TitleIdx field above, but
the search results format string displays the Title field.

In your case, you might consider putting both forms of the Title, with
and without diacritics, into the TitleIdx:


<Metadata name="Title">Prélude en ut dièse mineur, op. 45 Prelude en ut diese mineur, op. 45</Metadata>

This idea adapts pretty broadly to various forms of metadata. For
example, in the same collection, someone searching for scores
published in London should also match Londres, so we do something
similar for the place of publication, providing all spelling of the
city that occur in the collection.

There are two flaws to this approach: it may mess up the relevance
rankings, and it does not work for the full-text index.


Tod A. Olson <tod@uchicago.edu> "How do you know I'm mad?" said Alice.
Sr. Programmer / Analyst "If you weren't mad, you wouldn't have
The University of Chicago Library come here," said the Cat.