| From | Katherine Don |
| Date | Tue, 16 Sep 2003 09:40:07 -0600 |
| Subject | Re: [greenstone-devel] Using Greenstone for XML Documents |
| In-Reply-To | (3F66A755-55D8C9E0-cs-waikato-ac-nz) |
|
hi Doug Heather Rolen has used TEI documents with Greenstone. You might want to contact her and ask how she did it. A link to her project: http://www.nybg.org/bsci/libr/rolen/page1.html And her email address is hrolen@nybg.org (Heather, I hope you dont mind me passing on your email address like this) Stefan Boddie has made a simple TEI plugin in the past - I'm not sure where it is at the moment - you could email him and ask (sjboddie@cs.waikato.ac.nz) We have an archive collection of the mailing lists, at www.nzdl.org/cgi-bin/library?a=p&p=about&c=gsarch - Searching for TEI will give you the relevant emails from our correspondence with Heather. One of greenstone's search engines, MGPP, is able to search different elements. Have a look at the user guide (http://www.greenstone.org/manuals/mgpp_user.pdf) It indexes a certain kind of XML. It allows 2 kinds of tags: level tags and metadata tags. Level tags must encompass all the text, and these correspond to greenstone sections. eg if Doc is the document level tag, and Sec is a section level tag, then the following is valid <Doc> <Sec>blah blah </Sec> <Sec>here is some text</Sec> <Doc> but this is not valid <Doc> some text <Sec>blah blah</Sec> </Doc> any other tags are treated as metadata or field tags, and just require a start and end tag eg <Doc> <Sec><Title>a title</Title> some text </Sec> <Sec><Title>another title</Title> more text </Sec> </Doc> If this was indexed, you could search at Doc or Sec level, and you can do a fielded search on Title. You can only retrieve elements that are levels though. here, you could retrieve the whole Doc, or an individual Sec, but not the Title. so, levels can be searched and retrieved, but any other elements can only be searched. In the greenstone archive format, there are metadata elements, and a single content element for each section. metadata elements can be indexed, but also stored in a gdbm database, so they can be retrieved. I assume that the content element can contain XML tags in it (but I am not sure). So, back to the point of all this. You could write a plugin for your document format (in perl). The plugin can look for eg <p> elements and make section breaks there (which you can call paragraphs) For all other elements (non-level elements), you need to make a decision as to whether you want them as greenstone metadata - search and retrieve, or just leave them as elements in the content - searchable, but not retrievable except as part of the enclosing section. Once you have a plugin that can process your documents, it is fairly simple to build a collection. whether you can format the elements the way you want to, I'm not sure. I suggest you ask Heather about what they did. I hope this helps, regards, Katherine Don Michael Dewsnip wrote: Hi Doug, | |