[greenstone-users] Re: Query about XMP support in PDFPlug in Greenstone

From John Thompson
DateWed, 13 Jun 2007 10:47:50 +1200
Subject [greenstone-users] Re: Query about XMP support in PDFPlug in Greenstone
In-Reply-To (506559-71742-qm-web90601-mail-mud-yahoo-com)
Hi Daniel,

The verdict is that PDFPlug does not support XMP metadata. So, I've
written a very basic new plugin, called MetadataXMPPlug, which can be
used to extract the metadata -before- PDFPlug is called.

To use:

1. You'll need to be using a recent version of Greenstone - I'm using
GSDL 2.72.

2. Download the new plugin from here:

http://hobbes.dlconsulting.com/~dlc/MetadataXMPPlug.pm

and then copy it into the directory:

<greenstone>/perllib/plugins

3. Modify the collect.cfg file so it reads something like this (note -
order is important. MetadataXMPPlug -must- be before PDFPlug):

...

# MetadataXMPPlug in pipeline to extract PDF metadata
plugin GAPlug
plugin TEXTPlug
plugin MetadataXMPPlug
plugin PDFPlug
plugin MetadataXMLPlug
plugin ArcPlug
plugin RecPlug

# Modified classifiers to make use of new metadata
classify AZList -metadata dc.Title
classify AZList -metadata dc.Creator
classify AZList -metadata dc.Subject
classify AZList -metadata pdf.Keywords

# Modified VList format statement to show off new metadata
format VList "<td>[link][icon][/link][srclink][srcicon][/srclink]</td>
<td><b>[dc.Title]</b><br />
{If}{[dc.Creator], Author(s): <i>[siblings('; '):dc.Creator]</i><br />}
{If}{[dc.Subject], Subject(s): <i>[siblings('; '):dc.Subject]</i><br />}
{If}{[pdf.Keywords], Keyword(s): <i>[siblings('; '):pdf.Keywords]</i><br />}
<small>(DocumentID: [xapMM.DocumentID])</small>
</td>"

...

4. You'll have to rename some of your files, as square brackets (and
parentheses) cause problems with PERL's regular expressions. It isn't
typically recommended to use [ and ] in Greenstone anyway
because of format statements.

5. Now when you run import you should see the MetadataXMPPlug at work,
spitting out messages like so:

...
MetadataXMPPlug: processing x.pdf
* extracted 10 pieces of metadata from PDF's XMP block
MetadataXMPPlug: processing y.pdf
* extracted 10 pieces of metadata from PDF's XMP block
MetadataXMPPlug: processing z.pdf
* extracted 19 pieces of metadata from PDF's XMP block
...

6. You might get a bunch of error messages from PDFPlug about "Invalid
Shift-JIS character" but they seem harmless enough.

7. If you view the Greenstone Archive files, you'll now see they contain
a lot more metadata including all of the XMP metadata I could easily
extract from the PDF. Note, there will now be a dc.Title metadata and a
Title one. I haven't found a way to turn off the automatic metadata
extraction thats part of PDFPlug - you might just have to rejig a few
format statements to print out the metadata that you want.

Cheers,
John

DL Consulting
Greenstone Digital Library and Digitisation Specialists
contact@dlconsulting.com
www.dlconsulting.com

daniel biset wrote:
> John,
> Thank you very much for your answer
> I send you the articles
> Thank again
> Best regards
> Daniel Biset
>
> ----- Original Message ----
> From: John Thompson <john@dlconsulting.co.nz>
> To: daniel biset <dbiset@yahoo.com>
> Sent: Wednesday, June 6, 2007 6:53:34 PM
> Subject: Re: Query about PDFPlug in Greenstone
>
> Hi Daniel,
>
> I haven't used the PDFPlug much, but I believe your suspicion is correct
> - that PDFPlug may not be correctly extracting all of the metadata from
> the XMP block.
>
> Basically Greenstone depends upon a third-party piece of software called
> pdftotext (Copyright 1996-2004 Glyph & Cog, LLC) to turn the PDF into
> HTML. During this process metadata is meant to be extracted from the PDF
> and re-expressed as META tags in the top of the HTML. Later Greenstone
> will use the fields you specified in the -metadata_fields arguments to
> associate the appropriate META tag values with the document. As far as
> I can see this step should support multiple instances of each tag, so I
> assume something is going wrong in the conversion stage.
>
> There are also some special cases surrounding Creator and Subject
> metadata, which may explain the strange behaviour with values being
> extracted even when you specify Author.
>
> If you could please send me the example PDFs used in your testing, I can
> track down where the issues are and fix them, or figure out some way to
> work around them.
>
> Cheers,
> John
>
> DL Consulting
> Greenstone Digital Library and Digitisation Specialists
> contact@dlconsulting.com
> www.dlconsulting.com <http://www.dlconsulting.com/>
>
>
>
> daniel biset wrote:
> >
> > Dear John Thompson:
> >
> >
> >
> > We□re working in preservation and digitalization of pdf□s documents,
> > and processing them with Greenstone 2.72.
> >
> > These pdf□s documents have the metadata incorporated in the same
> > document (from Archive menu, Propertys of document or Advanced menu,
> > Metadata of document) such as Title, Author, Keywords, Description.
> >
> > Now, we process them with Greenstone, with the PDFPlugin structured so:
> >
> >
> >
> > -metadata_fields
> > Title<dc.Title>,Author<dc.Creator>,Subject<dc.Description>,
> > Keywords<dc.Subject>
> >
> >
> >
> > in order to obtain metadata like DC.
> >
> > But, when the author or keywords are more than two, in the greenstone
> > xml archives, only the first occurrence is registered, or, in other
> > case, appears the metadata literal (see example 3).
> >
> > We□re investigating the structure of XMP□s metadata from each pdf□s
> > document, and our doubt is if Greenstone pdfplugin take the metadata
> > from XMP or from other source/part of pdf□s document. If the first
> > case is correct, how can I get extract the metadata dc from the XMP?
> > It is possible?
> >
> >
> >
> > I□m sending you three examples for clarify my purpose:
> >
> >
> >
> > *_Example 1_*
> >
> > XMP with three keywords and <rdf:Bag>, and there are a metadata
> > <dc.creator> but not <pdf:Author>
> >
> > _ _
> >
> > □□□□□□.
> >
> > <rdf:Description rdf:about='uuid:ae16480d-29fe-4feb-948d-9fee0f9523ef'
> >
> > xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
> >
> > <pdf:Producer>Acrobat Distiller 4.05 for Windows</pdf:Producer>
> >
> > <pdf:Keywords>
> >
> > <rdf:Bag>
> >
> > <rdf:li>Dublin Core</rdf:li>
> >
> > <rdf:li>Repositorios institucionales</rdf:li>
> >
> > <rdf:li>Arquitectura de la informacion</rdf:li>
> >
> > </rdf:Bag>
> >
> > </pdf:Keywords>
> >
> > </rdf:Description>
> >
> > □□□□□□□□..
> >
> > <dc:creator>
> >
> > <rdf:Seq>
> >
> > <rdf:li>Harry Wagner; Stuart Weibel</rdf:li>
> >
> > </rdf:Seq>
> >
> > </dc:creator>
> >
> > </rdf:Description>
> >
> > □□□□□..
> >
> > _ _
> >
> > XML obtained in Greenstone (appears only the first keyword, and the
> > authors are the same that are in dc:creator):
> >
> > _ _
> >
> > <Metadata name="*Encoding*">*utf8*</Metadata>
> >
> > * * <Metadata name="*dc.Creator*">*Harry Wagner; Stuart
> > Weibel*</Metadata>
> >
> > * * <Metadata name="*dc.Subject">**Dublin Core*</Metadata>
> >
> > * * <Metadata name="*Title*">*DCMI Registry*</Metadata>
> >
> > * * <Metadata name="*URL*">*http://C:/Archivos <http://c:/Archivos>
> de □.*
> >
> > * *
> >
> > * *
> >
> > _*Example 2*_
> >
> > XMP with four keywords only separated by semicolon (not structured
> > with rdf:li as in the previous example), <dc:subject> structured as
> > rdf:Bag with three occurrences differents of the pdf:Keywords,
> > <dc.creator> structured as rdf:Seq with two ocurrences
> >
> > □□□□
> >
> > <rdf:Description rdf:about='uuid:2fcacb6d-d22c-4505-beb7-83327f55d963'
> >
> > xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
> >
> > <pdf:Producer>Acrobat Web Capture 6.0</pdf:Producer>
> >
> > <pdf:Keywords>Biologi□a molecular; Anatomia; Genetica reproductiva;
> > Tomografi□a</pdf:Keywords>
> >
> > </rdf:Description>
> >
> > □□□□□..
> >
> > <dc:creator>
> >
> > <rdf:Seq>
> >
> > <rdf:li>Jantz, Ronald</rdf:li>
> >
> > <rdf:li>Giarlo, Michael J.</rdf:li>
> >
> > </rdf:Seq>
> >
> > </dc:creator>
> >
> > <dc:subject>
> >
> > <rdf:Bag>
> >
> > <rdf:li>Preservacion digital</rdf:li>
> >
> > <rdf:li>Repositorios institucionales</rdf:li>
> >
> > <rdf:li>Arquitectura de la informacion</rdf:li>
> >
> > </rdf:Bag>
> >
> > </dc:subject>
> >
> > □□□□□
> >
> >
> >
> > XML obtained in Greenstone (appear only the first occurrence from the
> > dc:creator, but in the pdfpluging I□ve wrote □Author<dc.Creator>□,
> > then I would like to know from which part of the XMP file Greenstone
> > extract the metadata author?
> >
> >
> >
> > <Metadata name="*Encoding*">*utf8*</Metadata>
> >
> > * * <Metadata name="*dc.Creator*">*Jantz, Ronald*</Metadata>
> >
> > * * <Metadata name="*dc.Subject*">*Biolog□a molecular; Anatom□a;
> > Gen□tica reproductiva; Tomograf□a*</Metadata>
> >
> > * * <Metadata name="*dc.Description*">*Developing preservation*
> >
> > * *
> >
> > With respect to <dc:subject> <rdf:Bag>□.., in other experience I□ve
> > wrote in the pdfplugin dc:subject<dc.Subject> in order to extract
> > those occurrences that are in the XMP under the tag <dc:subject>, with
> > negative results.
> >
> >
> >
> > *_Example 3_*
> >
> > In this case of XMP
> >
> > □□□□□□.
> >
> > <rdf:Description rdf:about='uuid:8f3aa90a-7a2c-4cdc-a27c-450cac5535e5'
> >
> > xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
> >
> > <pdf:Keywords></pdf:Keywords>
> >
> > <pdf:Producer>PDF CoDe 2.01.20060306 (c) 2002-2006 European
> > Commission</pdf:Producer>
> >
> > </rdf:Description>
> >
> > □□□□..
> >
> > <dc:subject>
> >
> > <rdf:Bag>
> >
> > <rdf:li>Preservación digital</rdf:li>
> >
> > <rdf:li>Metadatos</rdf:li>
> >
> > </rdf:Bag>
> >
> > </dc:subject>
> >
> > □□□□□□.
> >
> > _ _
> >
> > In XML appear
> >
> > <Metadata name="*dc.Subject*">*"keywords"*</Metadata>
> >
> >
> >
> > I□ll be grateful if you could give me some aid.
> >
> > Best regards
> >
> >
> >
> > Daniel Horacio Biset
> >
> > Comisi□n Nacional de Energia Atomica
> >
> > Buenos Aires
> >
> > Argentina
> >
> >
> > ------------------------------------------------------------------------
> > You snooze, you lose. Get messages ASAP with AutoCheck
> >
> <http://us.rd.yahoo.com/evt=47959/*http://advision.webevents.yahoo.com/mailbeta/newmail_html.html>
> > in the all-new Yahoo! Mail Beta.
>
>
> ------------------------------------------------------------------------
> Pinpoint customers
> <http://us.rd.yahoo.com/evt=48250/*http://searchmarketing.yahoo.com/arp/sponsoredsearch_v9.php?o=US2226&cmp=Yahoo&ctv=AprNI&s=Y&s2=EM&b=50>who
> are looking for what you sell.