Re: [greenstone-users] unicode character/letter synonyms in mg

From Michael Dewsnip
DateTue, 14 Oct 2003 14:21:57 +1300
Subject Re: [greenstone-users] unicode character/letter synonyms in mg
In-Reply-To (20031007125826I-tao-lib-uchicago-edu)
Hi Tod,

Yes, I agree with your comments. The good news is that Greenstone 3 uses XML,
and XSLT as the "embedded language" that you talk about for things like
format statements and macros.

We're hoping that this will be beneficial for the reasons that you mention,
but I hope XSLT isn't going to scare away any users!

Regards,

Michael

Tod Olson wrote:

> >>>>> "SB" == Stefan Boddie <sjboddie@cs.waikato.ac.nz> writes:
>
> SB> We've run into issues like this before, notably with Maori where
> SB> text is often written with or without accents on certain
> SB> characters. Michael is quite correct in all he says, mg and mg++
> SB> treat words containing different characters as different words,
> SB> the only exception being for case folding.
> [...]
>
> SB> The "right" way to solve the problem is no doubt to implement some
> SB> general solution that works in the same way as case folding
> SB> currently does. Unicode does have some support for doing that kind
> SB> of thing I think but I don't know much about it.
>
> The "right" way is more general, I think. Expose the normalization of
> the metadata to the person doing the configuration, possibly at the
> level of lexing and parsing.
>
> Implementation would be via an embeddable language, such as Tcl.
> Configuration statements would be implemented as Tcl procs. There
> would be a statment to let the user define a custom normalization
> proc. Effectively, the configuration file would just be written in
> Tcl, using calls to your special procs. Similar things can be
> done with most languages these days.
>
> A default normalizer could be specified, but it could be overridden on
> a per-index basis. This is similar to overriding the default
> classifier displays on a per-index basis using format strings.
> Certainly I'd want to normalized date metadata differently from title
> metadata.
>
> Once you have an embeddable language available in the config file, you
> could use it to replace the format strings and maybe use it in the
> macro files. As of 2.40, Greeenstone maintainers have to maintain two
> custom languages: the format strings and the macro language. Replace
> those with an embeddable language and you get some real benefits:
>
> 1. You are no longer in the language development business.
> 2. The configurer has a full-featured language to work with.
> 3. There are already books describing the that language.
> 4. Since the config language is no longer particular to Greeenstone,
> the person learning it has some hope of reusing in elsewhere.
>
> We have an in-house database that takes this approach, and it has been
> quite successful for internal projects. It exposes the lexing and the
> parsing of the data to the user. Very powerful abstraction.
>
> Tod A. Olson <tod@uchicago.edu> "How do you know I'm mad?" said Alice.
> Sr. Programmer / Analyst "If you weren't mad, you wouldn't have
> The University of Chicago Library come here," said the Cat.
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users