Search in documents for of the words  

 

About this collection

This collection demonstrates Greenstone's ability to build collections from documents provided in different formats. It contains a number of papers written by various members of the NZDL project in PDF, MSWord, RTF, and Postscript formats.

How the collection works

This collection's configuration file contains the four plugins WordPlug, RTFPlug, PDFPlug and PSPlug (along with the standard three, GAPlug, ArcPlug and RecPlug). These four plugins all extract Title and Source (i.e. filename) metadata.

Greenstone contains third-party software that is used to convert Word, RTF, PDF and PostScript files into HTML. The Greenstone team does not maintain these modules, although we do include the latest versions with each Greenstone release. Bugs arise with unusual Word documents (e.g. from older Macintosh systems), and sometimes the text is badly extracted. Some PDF files have no machine-readable text at all, comprising instead a sequence of page images from which text can only be extracted by optical character recognition (OCR), which Greenstone does not attempt. If you encounter these problems, there is nothing much you (or we) can do other than omit the rogue documents from the collection, or try to obtain different versions of them.

The configuration file includes a single index, based on document text, and one classifier, an AZList based on Title metadata, shown here (the alphabetic selector is suppressed automatically because the collection contains only a few documents). However, no format statement is specified. In the absence of explicit information, Greenstone supplies sensible defaults. In this case, the default format for the classifier gives:

  • an icon for the HTML version of the document (the text that is actually indexed, essentially the same as the Greenstone Archive format);
  • an icon for the original version of the document (clicking it opens the document in its original form);
  • Title metadata, extracted from the document;
  • Source (i.e. filename) metadata, extracted from the document.

Here is a format statement that achieves exactly the same effect explicitly. It applies to all Vlists, and so controls both search results list and the alphabetic title browser.

 format VList "<td>[link][icon][/link]</td>
               <td>[srclink][srcicon][/srclink]</td>
               <td>[Title]<br><i>([Source])</i></td>"
 

How to find information in the MSWord and PDF demonstration collection

There are 2 ways to find information in this collection:

  • search for particular words that appear in the text by clicking the Search button
  • browse documents by Title by clicking the Titles button