Re: [greenstone-devel] inquiries on document uploading

From Michael Dewsnip
DateFri, 19 Sep 2003 14:53:43 +1200
Subject Re: [greenstone-devel] inquiries on document uploading
In-Reply-To (3F691BAC-5060001-asti-dost-gov-ph)
Hello Ivy,

Unfortunately, problems like these are largely out of our control.
Greenstone relies on external programs to convert Word and PDF files (and
some others) into plaintext/HTML form for indexing.

- For Word documents, we use wv: http://wvware.sourceforge.net. From the
website, this supports Word documents from Word 2 up to Word 2000 (XP is
not mentioned).

- For PDF documents we use pdftohtml: http://pdftohtml.sourceforge.net.
However, the version included with Greenstone is slightly out-of-date
(version 0.34, the latest is 0.36). At some point we should test the latest
version and include it with Greenstone (maybe for the next release). It is
worth noting that some PDFs just contain images of the pages (no text), and
these are impossible to get any text out of (without doing OCR, that is!).

The converters usually do a reasonable job of extracting the text, but it
sounds like they aren't doing so well with some of your files. It would be
worth importing just the files you are having problems with (make sure
you're got WordPlug and PDFPlug in your collect.cfg file!) and checking the
doc.xml files in the "archives" directory. This will show you the text that
is being pulled out of the files.

Regarding your last question, it sounds like you should use metadata.xml
files to assign your metadata (especially if a lot of it is the same). Have
a look at the metadata.xml file in the demo collection's "import" folder as
an example.

Alternatively, the Greenstone Librarian Interface tool (with Greenstone
v2.40a) provides a graphical interface for assigning metadata that you
might find easier to use.

Hope this helps,

Michael

Ivy Cabeza wrote:

> Good day to all greenstone developers and users.
> Sorry to bother. I would like to inquire if there is a specified
> standard on the files that will be uploaded in the greenstone digital
> library. We tried uploading several PDF and Word files and I was hoping
> that they will work with gsdl's full-text search capability. The PDF
> files were uploaded but are not searchable by its content (full-text
> search). Same goes to some Word files. Does greenstone support Word
> 2000/XP files? What are the requirements to enable effective full-text
> on PDF files?
> Also, what is the easiest way to attach metadata especially if one needs
> to tag thousands of files?
>
> I hope you help me out on this. Thank you very much.
>
> _______________________________________________
> greenstone-devel mailing list
> greenstone-devel@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel