Re: Searching paragraphs

From Stefan Boddie
DateMon, 30 Sep 2002 11:24:08 +1200
Subject Re: Searching paragraphs
In-Reply-To (B9B7154F-114F%eric-morgan-infomotions-com)
>
> I have a question about searching paragraphs.
>
> I have a set of plain text documents, and I would like to know how I can
> configure Greenstone to allow the end-user to search and/or extract text
on
> a paragraph level as well as an entire document level? In other words,
when
> the search results are returned how can each returned item be a paragraph
> from the original text?
>

What you need is for Greenstone to split your documents into sections and
then search on a "section:text" index.

The TEXTPlug plugin used to process plain text files can't help you much
here. If you made your documents simple html files instead you could use
HTMLPlug's -description_tags option. This would still involve marking up
your documents to specify where each section begins and ends (see the
section titled "Tagging documet files" on page 36 of the developer's guide
for more details).

Another option is to create yourself a new plugin. This is quite a simple
case so you could probably start with TEXTPlug and alter it to create a new
section whenever a paragraph break was encountered.

> Alternatively, I am willing to mark up my plain text documents in some
> flavor of XML. I am currently leaning towards TEILite. I will then be able
> to convert my TEI files to other formats such as PDF and HTML through the
> use of XSL. How can I exploit the functionality of Greenstone to take
> advantage of my structured TEI files? If my documents are in HTML, can I
> configure something to look between the <p></p> tags for content?
> Will/should I write a plugin?
>

Unfortunately the HTML plugin can't be configured to do this. It's really
just a simplified version of the way the -description_tags option works
though so it'd probably only be a minor change. That is, instead of looking
for everything between <Section> ... </Section> you'd need to make it look
for <p> ... </p> instead.

If you want to process TEILite documents you'll need to write a plugin to do
so.

regards,
Stefan.