Re: [greenstone-users] unicode character/letter synonyms in mg

From Michael Dewsnip
DateThu, 02 Oct 2003 16:58:48 +1200
Subject Re: [greenstone-users] unicode character/letter synonyms in mg
In-Reply-To (OFAF752C48-07325474-ON69256DB2-0025AD86-69256DB2-0025ADA7-ntu-edu-au)
Hi Stephen,

> Hi, I hope this isn't a silly question....

There's no such thing as a silly question... just silly answers :-)

> ...how does mg (and mg++) deal with searching for accented characters?

> I am working with a Portuguese language collection and I have noticed a
> number of words are spelt with and without accents on particular
> characters.
> Sometimes there are even different accents, which are different to the
> Portuguese language spell-check dictionary that I use. (Omnipage Pro 12)
>
> I am assuming I am dealing with the same word ,which may not always be the
> case, many times I feel the unaccented version was used due to the printing
> technology available to the publisher; sometimes they did not have the
> right accent available.
>
> I am unable to find any literature on the subject (but I may be searching
> with the wrong terms. hahaha).

My understanding of your question is that you have two words, both the same,
except that one has an accent where the other doesn't. You want them to be
treated the same for searching purposes.

I don't know too much about this, and feel out of my league here (in other
words, no doubt I'll get it wrong), but it seems like a near-impossible task
for a computer program to do in general. And therefore, I doubt that mg or
mgpp do anything like it. Two words containing different characters will be
treated as different words (except if something like case-folding is done).

One difficulty with your problem would be determining that two characters are
the same (barring an accent). I think you would end up having to create
hard-wired, encoding-specific tables to show which characters are "the same".
More importantly, I think it would be a feature that would be unhelpful some
of the time (or even a lot of the time). It is likely that in many cases two
words that are similar in this way are actually two different words (as you
point out), and *shouldn't* be treated the same.

I'd like to end with something positive, but it seems to me that it would be
easier to fix all the documents manually rather than try to compensate for the
problem later on! If there are a fairly limited number of words that are
causing problems you could do something special at build-time perhaps, but I
don't think you could do a general solution to this problem.

Regards,

Michael