Re: [greenstone-users] unicode character/letter synonyms in mg

From Tod Olson
DateTue, 07 Oct 2003 12:58:26 -0500
Subject Re: [greenstone-users] unicode character/letter synonyms in mg
In-Reply-To (01c201c388bf$00be7a00$0400a8c0-spencer)
>>>>> "SB" == Stefan Boddie <sjboddie@cs.waikato.ac.nz> writes:

SB> We've run into issues like this before, notably with Maori where
SB> text is often written with or without accents on certain
SB> characters. Michael is quite correct in all he says, mg and mg++
SB> treat words containing different characters as different words,
SB> the only exception being for case folding.
[...]

SB> The "right" way to solve the problem is no doubt to implement some
SB> general solution that works in the same way as case folding
SB> currently does. Unicode does have some support for doing that kind
SB> of thing I think but I don't know much about it.

The "right" way is more general, I think. Expose the normalization of
the metadata to the person doing the configuration, possibly at the
level of lexing and parsing.

Implementation would be via an embeddable language, such as Tcl.
Configuration statements would be implemented as Tcl procs. There
would be a statment to let the user define a custom normalization
proc. Effectively, the configuration file would just be written in
Tcl, using calls to your special procs. Similar things can be
done with most languages these days.

A default normalizer could be specified, but it could be overridden on
a per-index basis. This is similar to overriding the default
classifier displays on a per-index basis using format strings.
Certainly I'd want to normalized date metadata differently from title
metadata.

Once you have an embeddable language available in the config file, you
could use it to replace the format strings and maybe use it in the
macro files. As of 2.40, Greeenstone maintainers have to maintain two
custom languages: the format strings and the macro language. Replace
those with an embeddable language and you get some real benefits:

1. You are no longer in the language development business.
2. The configurer has a full-featured language to work with.
3. There are already books describing the that language.
4. Since the config language is no longer particular to Greeenstone,
the person learning it has some hope of reusing in elsewhere.

We have an in-house database that takes this approach, and it has been
quite successful for internal projects. It exposes the lexing and the
parsing of the data to the user. Very powerful abstraction.

Tod A. Olson <tod@uchicago.edu> "How do you know I'm mad?" said Alice.
Sr. Programmer / Analyst "If you weren't mad, you wouldn't have
The University of Chicago Library come here," said the Cat.