Re: [greenstone-users] Re Greenstone, prefered file types

From Michael Dewsnip
DateThu, 14 Aug 2003 11:09:29 +1200
Subject Re: [greenstone-users] Re Greenstone, prefered file types
In-Reply-To (005201c36121$ad7ef6e0$0971fea9-MYISP-COM)
Hi Eric,

Great to see New Zealanders considering Greenstone!

For a certain document, there are two different ways it can be used in

1. Text (and perhaps metadata) is extracted from the document, and used for
indexing. When it comes time to display a document, the original document is
used (and so you must have a suitable program for viewing that file type).
The advantages of this are that it is fairly quick and easy, and you see the
original documents. The disadvantages are that you must keep the original
documents around (as well as the full-text index), and you must have separate
programs for viewing the files. I don't think either of these disadvantages
affect you, so I suggest this as the best way to go.

2. The alternative is to convert the document to HTML (or text), and then
index this. The advantages of this are that the original documents are no
longer needed, and no separate programs are needed (a web-browser will do).
The obvious disadvantages are that the conversion process is generally pretty
poor, and the HTML documents may only vaguely resemble the original. From the
description of your situtation I think this is probably very undesirable.

Of course, you can also do both, and let the user choose between HTML
versions and the original document. A good example of this is the Word and
PDF demo, available at I suggest you
have a look at this collection and try browsing the titles - you will see
that you can choose between the HTML version and the original
Word/PDF/PS documents.

If you decide to go with the first option, you will need some way of ripping
out the text from your documents. Unfortunately Greenstone doesn't have
plugins for ".max" or DjVu files (PDF is supported, however). You will need
to look around for programs to convert your max and DjVu files into a
text-based form. Under Unix, you can also resort to the "strings" program
(which may or may not work, depending on the design of the file format), I'm
not sure if a Windows equivalent is available.

Hope this helps you get started.

All the best,


Eric Beanland wrote:

> For several years I have been building a personal collection of historic
> science books and articles in digital format and stored on CD Rs. I have
> used various versions of Scansoft Paperport Deluxe to handle and sort the
> collection and, most recently, Paperport and Abbyy Fine reader for OCR. At
> present the files are "max" format, or in PDF (image with text under) .
> A few are in my preferred format;"DjVu".
> A major issue is the ability to search content on individual CDR disks.
> Greenstone now seems by far the best option and I intend to try it on my
> collection. Before starting though I would like to be better informed on
> file types:
> Can Greenstone be persuaded to "see" *.max files: Image of text with OCR
> text included?.
> Which flavour of PDF is prefered?.
> Can Greenstone access DjVu files.
> Thank You, Eric
> _______________________________________________
> greenstone-users mailing list