Re: [greenstone-users] Extracted Greenstone Metadata from Adobe

From David Robley
DateFri, 14 Oct 2005 13:39:16 +0930
Subject Re: [greenstone-users] Extracted Greenstone Metadata from Adobe
In-Reply-To (434F0889-9020700-cs-waikato-ac-nz)
On Fri, 14 Oct 2005 10:53, Chi-Yu Huang wrote:
> Hi Devon,
>
> Devon Cinnamon wrote:
> >>___________________________________________________________
> >>
> >>Hey, so Greenstone is working amazing so far getting my Word metadata
> >>out of Word and into Greenstone. I haven't put it to a whole big
> >> test yet but so far so good.
> >>
> >>My next problem that I have come across however is that I now need
> >>Greenstone to Export the Adobe Acrobat Reader document properties
> >> into
> >>
> >>
> >>Greenstone using the same format as how it treats word. This way I
> >> can
> >>
> >>
> >>input the information into Adobe properties and input it into
> >> Greenstone
> >>
> >>
> >>and Greenstone will grab the information and assign it an ex.whatever
> >>
> >>
> >>tag and then I can do a search on that material in the search page of
> >>Greenstone.
>
> I've done a bit modifications of PDFPlug to meet your requirements
> above. The updated Perl program can be downloaded from :
> http://www.scms.waikato.ac.nz/~chi/Greenstone/ . Please download
> /HTMLPlug, PDFPlug/ and copy them into your $GSDLHOME\perllib\plugins
> directory, and /string.rb/ into $GSDLHOME\perllib. This new change will
> allow you to specify metadata_fields you would like to extract from
> your PDF documents (like the way you can do for word documents). You
> also need to modify the PDFPlug options at
> $GSDLHOME\collect\collect_name\etc\collect.cfg to somethings like
> below:
>
> plugin PDFPlug -convert_to html -metadata_fields
> Title,Generator,Date,Author,Subject,Keywords
>
> The metadata_fields option is to retrieve metadata from the HTML
> document converted by pdftohtml. It allows users to define comma
> separated list of metadata fields to attempt to extract. Use
> 'tag<tagname>' to have the contents of the first <tagname> pair put in
> a metadata element called 'tagname'. Capitalise this as you want the
> metadata capitalised in Greenstone, since the tag extraction is case
> insensitive.
>
> >>I would like it to be the same tag used for both word and
> >>
> >>
> >>adobe so that when I set up a search then I only have to specify one
> >>search type for both word and pdf. For example, if I want to do a
> >>search on Author and I have put the Author in both Adobe and Word
> >>documents then I can just search for author in Greenstone. As of
> >> right
> >>
> >>
> >>now Greenstone actually does export the author property into
> >> Greenstone
> >>
> >>
> >>but unfortunetly the metadata tag is labeled ex.Creator; therefore, I
> >>have to specify both an author and a creator in the search menu if I
> >>want to search for an author between pdf and word.
>
> Actually, in this case, you only need to create a search index based on
> both /author/ and /creator/ metadata fields. Thus, the system should
> search from both metadata fields.
>
> >>I have included a document that has an image on the first page of a
> >> PDF
> >>
> >>
> >>file that has all the document properties filed out which I would
> >> like
> >>
> >>
> >>inserted into Greenstone and on the second page of the document is
> >> what
> >>
> >>
> >>Greenstone actually imports.
> >>
> >>Is there a way currently to export the document properties from the
> >> PDF
> >>
> >>
> >>document? If not will this be included in a future release of
> >>Greenstone?
> >>
> >>Thanks again for all your help
> >>-------------------------------------------------
>
> Devon Cinnamon
>
>
> Please let me know how you get on with this!
>
> best wishes,
> Chi

I was about to pose the same question; I can say that it works for me now.

My PDF documents have comma separated lists of both author and keyword and
as a result all the keywords, or author listings for any particular
document are grouped together in the listing. Is it possible to explode
the comma separated list to provide separate keyword and author listings?

I guess the way author names are stored would need to be revised our end,
as currently we use eg "Alan Ralph, John Winston Toumbourou, Morgen
Grigg, Rhiannon Mulcahy, Michael Carr-Gregg and Matthew R. Sanders".
Presumably this would need to be like "Ralph Alan, Toumbourou John
Winston, ..." ??

Another request that has been made to me is to be able to list all the
documents by the "issue" of the journal that they appear in. The journal
has issues like 'Vol 1 Issue 1' 'Vol 1 Issue 2' etc with a number of
articles in each; each article is a separate PDF doc. My first thought
would be to put the issue in the Title property of the PDF doc, and the
actual document title in the subject; as I understand it I should then be
able to use the ex.Title to group by issue number, and still use
ex.Subject to create a title group.


Cheers
--
David Robley

Best diet: Eat as much as you want, but don't swallow it.