Re: [greenstone-users] Fields pdf and metadata

From John R. McPherson
DateFri, 08 Oct 2004 11:23:05 +1300
Subject Re: [greenstone-users] Fields pdf and metadata
In-Reply-To (5-0-2-1-2-20041006135608-00b62610-pop-doc-bondy-ird-fr)
On Thu, 2004-10-07 at 01:02, Pier.Luigi.Rossi@bondy.ird.fr wrote:
> Hi,
> I would like to know if it is possible to extract pdf fields (like author,
> subject, keywords)
> as metadata in greenstone and index it ... like the filed title.

Hi,
we use a 3rd-party program, called "pdftohtml", to extract text from
.PDF files. Unfortunately it doesn't seem to extract all the document
metadata, but it does extract title, date, and author.

We have an updated PDF plugin that will now use the date and author
metadata from the file, which you can get below. It won't get the
keywords or subject metadata though. Note that the author metadata will
be renamed to "Creator", and date will be renamed to "Date" metadata
inside greenstone.

Save this file into the <greenstone dir>perllibplugins directory:
http://www.greenstone.org/tmp/PDFPlug.pm

This PDF plugin also needs updated versions of the following 2 files:
http://www.greenstone.org/tmp/HTMLPlug.pm
(saved to perllibplugins)
and
http://www.greenstone.org/tmp/unicode.pm
(saved to perllib).

John McPherson