Re: [greenstone-users] unicode character/letter synonyms in mg

From Michael Dewsnip
DateThu, 02 Oct 2003 17:08:03 +1200
Subject Re: [greenstone-users] unicode character/letter synonyms in mg
In-Reply-To (3F7BB088-65788AB8-cs-waikato-ac-nz)
PS Unicode could help you a bit with the first difficulty I mentioned, but I
think my second point is still valid.

> Hi Stephen,
> > Hi, I hope this isn't a silly question....
> There's no such thing as a silly question... just silly answers :-)
> > does mg (and mg++) deal with searching for accented characters?
> > I am working with a Portuguese language collection and I have noticed a
> > number of words are spelt with and without accents on particular
> > characters.
> > Sometimes there are even different accents, which are different to the
> > Portuguese language spell-check dictionary that I use. (Omnipage Pro 12)
> >
> > I am assuming I am dealing with the same word ,which may not always be the
> > case, many times I feel the unaccented version was used due to the printing
> > technology available to the publisher; sometimes they did not have the
> > right accent available.
> >
> > I am unable to find any literature on the subject (but I may be searching
> > with the wrong terms. hahaha).
> My understanding of your question is that you have two words, both the same,
> except that one has an accent where the other doesn't. You want them to be
> treated the same for searching purposes.
> I don't know too much about this, and feel out of my league here (in other
> words, no doubt I'll get it wrong), but it seems like a near-impossible task
> for a computer program to do in general. And therefore, I doubt that mg or
> mgpp do anything like it. Two words containing different characters will be
> treated as different words (except if something like case-folding is done).
> One difficulty with your problem would be determining that two characters are
> the same (barring an accent). I think you would end up having to create
> hard-wired, encoding-specific tables to show which characters are "the same".
> More importantly, I think it would be a feature that would be unhelpful some
> of the time (or even a lot of the time). It is likely that in many cases two
> words that are similar in this way are actually two different words (as you
> point out), and *shouldn't* be treated the same.
> I'd like to end with something positive, but it seems to me that it would be
> easier to fix all the documents manually rather than try to compensate for the
> problem later on! If there are a fairly limited number of words that are
> causing problems you could do something special at build-time perhaps, but I
> don't think you could do a general solution to this problem.
> Regards,
> Michael