Re: [greenstone-devel] Using Greenstone for XML Documents

From Katherine Don
DateTue, 16 Sep 2003 09:40:07 -0600
Subject Re: [greenstone-devel] Using Greenstone for XML Documents
In-Reply-To (3F66A755-55D8C9E0-cs-waikato-ac-nz)
hi Doug

Heather Rolen has used TEI documents with Greenstone. You might want to contact her and ask how she did it.
A link to her project: http://www.nybg.org/bsci/libr/rolen/page1.html

And her email address is hrolen@nybg.org
(Heather, I hope you dont mind me passing on your email address like this)

Stefan Boddie has made a simple TEI plugin in the past - I'm not sure where it is at the moment - you could email him and ask
(sjboddie@cs.waikato.ac.nz)

We have an archive collection of the mailing lists, at www.nzdl.org/cgi-bin/library?a=p&p=about&c=gsarch - Searching for TEI will give you the relevant emails from our correspondence with Heather.

One of greenstone's search engines, MGPP, is able to search different elements. Have a look at the user guide (http://www.greenstone.org/manuals/mgpp_user.pdf)

It indexes a certain kind of XML. It allows 2 kinds of tags: level tags and metadata tags. Level tags must encompass all the text, and these correspond to greenstone sections.

eg if Doc is the document level tag, and Sec is a section level tag, then the following is valid
<Doc>
<Sec>blah blah </Sec>
<Sec>here is some text</Sec>
<Doc>

but this is not valid
<Doc>
some text
<Sec>blah blah</Sec>
</Doc>

any other tags are treated as metadata or field tags, and just require a start and end tag
eg
<Doc>
<Sec><Title>a title</Title>
some text
</Sec>
<Sec><Title>another title</Title>
more text
</Sec>
</Doc>

If this was indexed, you could search at Doc or Sec level, and you can do a fielded search on Title.
You can only retrieve elements that are levels though. here, you could retrieve the whole Doc, or an individual Sec, but not the Title.

so, levels can be searched and retrieved, but any other elements can only be searched.

In the greenstone archive format, there are metadata elements, and a single content element for each section. metadata elements can be indexed, but also stored in a gdbm database, so they can be retrieved.
I assume that the content element can contain XML tags in it (but I am not sure).

So, back to the point of all this.

You could write a plugin for your document format (in perl). The plugin can look for eg <p> elements and make section breaks there (which you can call paragraphs) For all other elements (non-level elements), you need to make a decision as to whether you want them as greenstone metadata - search and retrieve,  or just leave them as elements in the content - searchable, but not retrievable except as part of the enclosing section.

Once you have a plugin that can process your documents, it is fairly simple to build a collection. whether you can format the elements the way you want to, I'm not sure. I suggest you ask Heather about what they did.

I hope this helps,

regards,
Katherine Don



Michael Dewsnip wrote:
Hi Doug,

I'm afraid the answer to both your questions is a "well, not really...". Let me expand on this a bit, in regards to your questions:

1. During the import process, Greenstone converts documents into its internal format: GML (Greenstone Markup Language) - an XML-based, but very simple and static format. If you have installed Greenstone you might like to have a look at one of the "doc.xml" files in the archives folder of the "demo" collection, to see what I mean. The simple structure of these documents means that it could be difficult to map your XML files into this format, and keep the information and structure you need. If your documents are highly structured it is not clear how you would be able to search for data within any particular element, for example. You may be better off looking at a more general XML retrieval system (Lucene is one, I believe).

Our next-generation software, Greenstone 3, changes all this, being completely XML based. In fact, I believe that a TEI demonstration collection has already been built with it. Our first Greenstone 3 release is planned for October 31st.

2. We don't have those facilities at the moment, but we actually have a Masters student currently doing a project on that exact topic! However the project won't finish for a few months yet.

Sorry I couldn't be more helpful. I think in three months time we would have something a lot closer to what you are after, but obviously this doesn't help you much now.

Good luck,

Michael
 
 

Doug Black wrote:

I've been exploring Greenstone to use with collections of SGML and XML documents. One set is XML TEI and another is an proprietary SGML document type that is generally a book structure. It could reasonably easily be converted to XML. I have basic two needs for which I am first seeking general answers as to whether Greenstone is a feasible tool.

1. First will Greenstone handle XML documents with relative ease? I see there is an auxiliary XML plugin but I haven't been able to understand it yet. Included in this question can Greenstone search for data within any particular element and format any particular element. Also we need to search on a paragraph level which generally seems possible by mapping them to GS <section>s, but do I really have to embed commented out <section> tags throughout the document? Is there a way of mapping a <p> element in the import documents to GS <section> elements more generically?

2. Each of these collections are indexed with terms from a thesaurus using embedded elements in the XML/SGML. Is searching with a thesaurus plausible with Greenstone?

Thanks,

Doug

Doug Black
West Rock Visions
137 Alden Avenue
New Haven, CT 06515
Voice and Fax: (203) 389-0184
doug@westrockvisions.com