Re: [greenstone-devel] Using Greenstone for XML Documents

From Michael Dewsnip
DateTue, 16 Sep 2003 18:01:57 +1200
Subject Re: [greenstone-devel] Using Greenstone for XML Documents
In-Reply-To (000001c37a46$a8a01a30$6b01a8c0-Thoughtful)
Hi Doug,

I'm afraid the answer to both your questions is a "well, not really...". Let me expand on this a bit, in regards to your questions:

1. During the import process, Greenstone converts documents into its internal format: GML (Greenstone Markup Language) - an XML-based, but very simple and static format. If you have installed Greenstone you might like to have a look at one of the "doc.xml" files in the archives folder of the "demo" collection, to see what I mean. The simple structure of these documents means that it could be difficult to map your XML files into this format, and keep the information and structure you need. If your documents are highly structured it is not clear how you would be able to search for data within any particular element, for example. You may be better off looking at a more general XML retrieval system (Lucene is one, I believe).

Our next-generation software, Greenstone 3, changes all this, being completely XML based. In fact, I believe that a TEI demonstration collection has already been built with it. Our first Greenstone 3 release is planned for October 31st.

2. We don't have those facilities at the moment, but we actually have a Masters student currently doing a project on that exact topic! However the project won't finish for a few months yet.

Sorry I couldn't be more helpful. I think in three months time we would have something a lot closer to what you are after, but obviously this doesn't help you much now.

Good luck,


Doug Black wrote:

I've been exploring Greenstone to use with collections of SGML and XML documents. One set is XML TEI and another is an proprietary SGML document type that is generally a book structure. It could reasonably easily be converted to XML. I have basic two needs for which I am first seeking general answers as to whether Greenstone is a feasible tool.

1. First will Greenstone handle XML documents with relative ease? I see there is an auxiliary XML plugin but I haven't been able to understand it yet. Included in this question can Greenstone search for data within any particular element and format any particular element. Also we need to search on a paragraph level which generally seems possible by mapping them to GS <section>s, but do I really have to embed commented out <section> tags throughout the document? Is there a way of mapping a <p> element in the import documents to GS <section> elements more generically?

2. Each of these collections are indexed with terms from a thesaurus using embedded elements in the XML/SGML. Is searching with a thesaurus plausible with Greenstone?



Doug Black
West Rock Visions
137 Alden Avenue
New Haven, CT 06515
Voice and Fax: (203) 389-0184

_______________________________________________ greenstone-devel mailing list