[greenstone-users] PDF collection with bibliographical metadata

From Birgit Kellner
DateMon, 20 Dec 2004 23:11:25 +0100
Subject [greenstone-users] PDF collection with bibliographical metadata
Hello,

searching the list archives I came across a few similar projects, but
I'm not yet terribly familiar with Greenstone and could therefore not
quite grasp the suggested solutions.

I have a bibliography database that is linked to PDF files. A file has
the ID of the matching database record at the end of its name; it was
therefore easy for me to bring record and filename together with a
regular expression scan of the directory. Currently there are about 2000
records and matching files.

The goal is to build a collection by primarily making use of the
bibliographical data, for browsing of titles and authors (keywording
might be added at a later stage), and for field searches using mgpp. I'm
not sure whether the quality of the PDF files and their language - many
of them are in Japanese - allow for text indexing. But at any rate, for
now, working with the bibliographical data for searches is sufficient
and all I currently need.

I thought of structuring metadata.xml like this:
<DirectorySet>
<FileSet>
<FileName>Kobayashi_Akira_1998_280.pdf</FileName>
<Description>
<Metadata name="Title">A Study of Something</Metadata>
<Metadata name="Author">Kobayashi, Akira</Metadata>
<Metadata name="Journal Title">Nani-ka no Kenkyu</Metadata>
<Metadata name="Volume">25</Metadata>
<Metadata name="Pages">2-24</Metadata>
<Metadata name="Year">1998</Metadata>
</Description>
</FileSet>
</DirectorySet>

My assumption is: If I place all pdf-files in the import-directory,
together with this metadata.xml that contains FileSet-tags for all of
them, I could import the collection with RecPlug. Is this correct?

If so, am I correct in assuming that, if I use the use_metadata_files
option, that RecPlug will then *not* extract titles from the PDF-files?
(Which is exactly what I want because the database has correct titles in
all cases anyway, so the information in metadata.xml should be given
preference.) Would I even need to use PDFPlug in addition if I just
wanted to make use of the metadata, and not build full-text indices?

I'm not sure whether my approach to the metadata fields is correct, or
practical. In the Greenstone DTD - which my xml editor currently reports
as "temporarily moved", by the way, so I can't try and validate -, the
"name"-attribute doesn't seem to have a fixed set of values - or are
these restricted somewhere else? Should I use a different DTD? Or write
my own metadata set? Or do I need both a DTD for validating metadata.xml
and a special metadata set for the librarian interface to be able to
edit data?

And: If I use metadata name attributes for specifying the
bibliographical fields, are these also the ones I can use to configure
the display, with format-statements in collect.cfg as it is done in the
Colt bibliography?

Apologies if these questions are all too newbie-ish, and best regards,

Birgit Kellner