[greenstone-users] Extracted Greenstone Metadata from Adobe

From Chi-Yu Huang
DateFri, 14 Oct 2005 14:23:21 +1300
Subject [greenstone-users] Extracted Greenstone Metadata from Adobe
In-Reply-To (s34e189d-045-mail-nelligan-ca)
Hi Devon,

Devon Cinnamon wrote:

>>___________________________________________________________
>>
>>Hey, so Greenstone is working amazing so far getting my Word metadata
>>out of Word and into Greenstone. I haven't put it to a whole big test yet but so far so good.
>>
>>My next problem that I have come across however is that I now need
>>Greenstone to Export the Adobe Acrobat Reader document properties into
>>
>>
>>Greenstone using the same format as how it treats word. This way I can
>>
>>
>>input the information into Adobe properties and input it into Greenstone
>>
>>
>>and Greenstone will grab the information and assign it an ex.whatever
>>
>>
>>tag and then I can do a search on that material in the search page of
>>Greenstone.
>>
I've done a bit modifications of PDFPlug to meet your requirements
above. The updated Perl program can be downloaded from :
http://www.scms.waikato.ac.nz/~chi/Greenstone/ . Please download
/HTMLPlug, PDFPlug/ and copy them into your $GSDLHOMEperllibplugins
directory, and /string.rb/ into $GSDLHOMEperllib. This new change will
allow you to specify metadata_fields you would like to extract from your
PDF documents (like the way you can do for word documents). You also
need to modify the PDFPlug options at
$GSDLHOMEcollectcollect_nameetccollect.cfg to somethings like below:

plugin PDFPlug -convert_to html -metadata_fields
Title,Generator,Date,Author,Subject,Keywords

The metadata_fields option is to retrieve metadata from the HTML
document converted by pdftohtml. It allows users to define comma
separated list of metadata fields to attempt to extract. Use
'tag<tagname>' to have the contents of the first <tagname> pair put in a
metadata element called 'tagname'. Capitalise this as you want the
metadata capitalised in Greenstone, since the tag extraction is case
insensitive.

>>I would like it to be the same tag used for both word and
>>
>>
>>adobe so that when I set up a search then I only have to specify one
>>search type for both word and pdf. For example, if I want to do a
>>search on Author and I have put the Author in both Adobe and Word
>>documents then I can just search for author in Greenstone. As of right
>>
>>
>>now Greenstone actually does export the author property into Greenstone
>>
>>
>>but unfortunetly the metadata tag is labeled ex.Creator; therefore, I
>>have to specify both an author and a creator in the search menu if I
>>want to search for an author between pdf and word.
>>
>>
Actually, in this case, you only need to create a search index based on
both /author/ and /creator/ metadata fields. Thus, the system should
search from both metadata fields.

>>I have included a document that has an image on the first page of a PDF
>>
>>
>>file that has all the document properties filed out which I would like
>>
>>
>>inserted into Greenstone and on the second page of the document is what
>>
>>
>>Greenstone actually imports.
>>
>>Is there a way currently to export the document properties from the PDF
>>
>>
>>document? If not will this be included in a future release of
>>Greenstone?
>>
>>Thanks again for all your help
>>-------------------------------------------------
>>
Devon Cinnamon

>>Systems Support Technician
>>Nelligan O'Brien Payne LLP
>>Suite 1900, 66 Slater
>>Ottawa, Ontario K1P 5H1
>>Tel: (613) 231-8250
>>Fax: (613) 238-2098
>>
>>devon.cinnamon@nelligan.ca
>>www.nelligan.ca
>>
>>
>>
Please let me know how you get on with this!

best wishes,
Chi

>>
>>
>>
>>
>>
>>Confidentiality Note
>>
>>This message is intended only for the use of the individual or entity
>>
>>
>to which it is addressed, and may contain information that is
>privileged, confidential and exempt from disclosure under applicable
>law. If the reader of this message is not the intended recipient, or the
>employee or agent responsible for delivering the message to the intended
>recipient, you are hereby notified that any dissemination, distribution
>or copying of this communication is strictly prohibited. If you have
>received this communication in error, please notify us immediately by
>telephone. Thank you.
>
>
>>
>>
>>
>>
>
>
>
>
>
>
>
>Confidentiality Note
>
>This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone. Thank you.
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20051014/1bf22d7a/attachment.html