Re: [greenstone-users] unicode character/letter synonyms in mg

From Stefan Boddie
DateThu, 2 Oct 2003 20:27:21 +1200
Subject Re: [greenstone-users] unicode character/letter synonyms in mg
In-Reply-To (3F7BB2B3-86DBFA41-cs-waikato-ac-nz)
Hi Stephen,

We've run into issues like this before, notably with Maori where text is
often written with or without accents on certain characters. Michael is
quite correct in all he says, mg and mg++ treat words containing different
characters as different words, the only exception being for case folding.

The easiest way to solve this is to alter the text that is indexed at build
time, replacing all the characters that have accents with unaccented
versions. You'd alter the indexed text only, not the compressed text, that
way the text is displayed correctly, including accents. You'd also need to
edit the C++ code slightly to do the same replacement within search terms at
run time (just in case someone enters text that contains accented
characters). You could even have two indexes, one containing the accented
text and one with accents removed. Selecting the former would do an "exact"
search while the latter would do an accent free search.

Anyway, all that relies on you writing some kind of plugin (or more likely a
specialization of mgbuildproc or mgppbuildproc) that knows how to map
accented characters to their accent free alternatives. It's easy enough to
do for a single language but difficult to do in a general way.

The "right" way to solve the problem is no doubt to implement some general
solution that works in the same way as case folding currently does. Unicode
does have some support for doing that kind of thing I think but I don't know
much about it.

Stefan.

----- Original Message -----
From: "Michael Dewsnip" <mdewsnip@cs.waikato.ac.nz>
To: <Stephen.DeGabrielle@cdu.edu.au>;
<greenstone-users@list.scms.waikato.ac.nz>
Sent: Thursday, October 02, 2003 5:08 PM
Subject: Re: [greenstone-users] unicode character/letter synonyms in mg


> PS Unicode could help you a bit with the first difficulty I mentioned, but
I
> think my second point is still valid.
>
>
> > Hi Stephen,
> >
> > > Hi, I hope this isn't a silly question....
> >
> > There's no such thing as a silly question... just silly answers :-)
> >
> > > ...how does mg (and mg++) deal with searching for accented characters?
> >
> > > I am working with a Portuguese language collection and I have noticed
a
> > > number of words are spelt with and without accents on particular
> > > characters.
> > > Sometimes there are even different accents, which are different to the
> > > Portuguese language spell-check dictionary that I use. (Omnipage Pro
12)
> > >
> > > I am assuming I am dealing with the same word ,which may not always be
the
> > > case, many times I feel the unaccented version was used due to the
printing
> > > technology available to the publisher; sometimes they did not have the
> > > right accent available.
> > >
> > > I am unable to find any literature on the subject (but I may be
searching
> > > with the wrong terms. hahaha).
> >
> > My understanding of your question is that you have two words, both the
same,
> > except that one has an accent where the other doesn't. You want them to
be
> > treated the same for searching purposes.
> >
> > I don't know too much about this, and feel out of my league here (in
other
> > words, no doubt I'll get it wrong), but it seems like a near-impossible
task
> > for a computer program to do in general. And therefore, I doubt that mg
or
> > mgpp do anything like it. Two words containing different characters will
be
> > treated as different words (except if something like case-folding is
done).
> >
> > One difficulty with your problem would be determining that two
characters are
> > the same (barring an accent). I think you would end up having to create
> > hard-wired, encoding-specific tables to show which characters are "the
same".
> > More importantly, I think it would be a feature that would be unhelpful
some
> > of the time (or even a lot of the time). It is likely that in many cases
two
> > words that are similar in this way are actually two different words (as
you
> > point out), and *shouldn't* be treated the same.
> >
> > I'd like to end with something positive, but it seems to me that it
would be
> > easier to fix all the documents manually rather than try to compensate
for the
> > problem later on! If there are a fairly limited number of words that are
> > causing problems you could do something special at build-time perhaps,
but I
> > don't think you could do a general solution to this problem.
> >
> > Regards,
> >
> > Michael
>
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>