Re: [greenstone-users] PDF collection with bibliographical metadata

From Katherine Don
DateTue, 21 Dec 2004 13:09:41 +1300
Subject Re: [greenstone-users] PDF collection with bibliographical metadata
In-Reply-To (41C74E0D-2060707-univie-ac-at)
Hi Birgit

Your approach looks fine - put the PDFs in the import directory, and the
bibliographic info into a metadata.xml file. RecPlug needs the
-use_metadata_files option to be set. As long as the filenames in the
metadata.xml file match the files, then the metadata will be added.

RecPlug by itself is not enough to process your files - you need to have
a plugin that will recognise each document type. You will need to add
either PDFPlug, or UnknownPlug.

PDFPlug will try to convert the PDF file to HTML, and extract metadata
and text. If the conversion doesn't work then the document will not be
added to the collection. This also makes importing quite slow. The
advantage is that if text can be extracted from the pdf, then you get
full text searching on the content. Japanese text will be fine too.

UnknownPlug can be set up to process any kinds of documents - by
specifying the -process_exp option. You would use something like
-process_exp .pdf$

It does nothing with the document itself, just adds it as an associated
file to the Greenstone document which has the metadata.
You get no text searching, but processing is fast, and you can search on
all the metadata.

If you don't want full text searching at this stage, you should use
UnknownPlug. Another option is to put both plugins in the list, with
PDFplug first. Then if PDFPlug can't process the file for some reason,
it can be picked up by UnknownPlug.

Metadata in the xml file should overwrite any metadata extracted from
the PDF. (If you use UnknownPLug there will be none anyway.)
If you want to add multiple values for metadata, you should use the
mode=accumulate attribute, eg

<Metadata name="Author">Kobayashi, Akira</Metadata>
<Metadata name="Author" mode="accumulate">Tanaka, Jun</Metadata>

You can use eg [Author], [Title], [Journal] etc in your format
statements to display any of the metadata you have assigned.

[link][icon][/link] will link to the Greenstone extracted html version
of the document (if you have used PDFPlug), or to a bibliographic
display (if you change the DocumentText format to display metadata).

[srclink][srcicon][/srclink] will link to the PDF.

For more formatting information, see

Re metadata.xml file formats, you can specify any metadata names you
like in there. Using the Librarian Interface, you need to specify
metadata sets, and can only add metadata that is in the set. But if you
are building by hand (command line) you can use any metadata names you like.

I hope this answers your questions.
The best thing to do with something like this is just to try it with a
few documents and see what results you get.

Katherine Don

Birgit Kellner wrote:
> Hello,
> searching the list archives I came across a few similar projects, but
> I'm not yet terribly familiar with Greenstone and could therefore not
> quite grasp the suggested solutions.
> I have a bibliography database that is linked to PDF files. A file has
> the ID of the matching database record at the end of its name; it was
> therefore easy for me to bring record and filename together with a
> regular expression scan of the directory. Currently there are about 2000
> records and matching files.
> The goal is to build a collection by primarily making use of the
> bibliographical data, for browsing of titles and authors (keywording
> might be added at a later stage), and for field searches using mgpp. I'm
> not sure whether the quality of the PDF files and their language - many
> of them are in Japanese - allow for text indexing. But at any rate, for
> now, working with the bibliographical data for searches is sufficient
> and all I currently need.
> I thought of structuring metadata.xml like this:
> <DirectorySet>
> <FileSet>
> <FileName>Kobayashi_Akira_1998_280.pdf</FileName>
> <Description>
> <Metadata name="Title">A Study of Something</Metadata>
> <Metadata name="Author">Kobayashi, Akira</Metadata>
> <Metadata name="Journal Title">Nani-ka no Kenkyu</Metadata>
> <Metadata name="Volume">25</Metadata>
> <Metadata name="Pages">2-24</Metadata>
> <Metadata name="Year">1998</Metadata>
> </Description>
> </FileSet>
> </DirectorySet>
> My assumption is: If I place all pdf-files in the import-directory,
> together with this metadata.xml that contains FileSet-tags for all of
> them, I could import the collection with RecPlug. Is this correct?
> If so, am I correct in assuming that, if I use the use_metadata_files
> option, that RecPlug will then *not* extract titles from the PDF-files?
> (Which is exactly what I want because the database has correct titles in
> all cases anyway, so the information in metadata.xml should be given
> preference.) Would I even need to use PDFPlug in addition if I just
> wanted to make use of the metadata, and not build full-text indices?
> I'm not sure whether my approach to the metadata fields is correct, or
> practical. In the Greenstone DTD - which my xml editor currently reports
> as "temporarily moved", by the way, so I can't try and validate -, the
> "name"-attribute doesn't seem to have a fixed set of values - or are
> these restricted somewhere else? Should I use a different DTD? Or write
> my own metadata set? Or do I need both a DTD for validating metadata.xml
> and a special metadata set for the librarian interface to be able to
> edit data?
> And: If I use metadata name attributes for specifying the
> bibliographical fields, are these also the ones I can use to configure
> the display, with format-statements in collect.cfg as it is done in the
> Colt bibliography?
> Apologies if these questions are all too newbie-ish, and best regards,
> Birgit Kellner
> _______________________________________________
> greenstone-users mailing list