Re: [greenstone-users] PDF collection with bibliographical metadata

From Birgit Kellner
DateWed, 22 Dec 2004 13:03:03 +0100
Subject Re: [greenstone-users] PDF collection with bibliographical metadata
In-Reply-To (41C769C5-10907-cs-waikato-ac-nz)
Hi Katherine,

many, many thanks for this quick and immensely helpful reply!
I'm working with UnknownPlug for the moment. On the whole, things work
fine, but there seem to be problems with Japanese metadata fields, and I
can't figure out what precisely they are.

A FileSet-element from my metadata.xml looks like this:

<Metadata name="Type" >Journal article</Metadata>
<Metadata name="Author" >Yamada, Ryujo</Metadata>
<Metadata name="Author_Japanese">?? ??</Metadata>
<Metadata name="Title">????????# ???????????????
<Metadata name="Journal">??</Metadata>
<Metadata name="Volume">03.04.04</Metadata>
<Metadata name="Pages">01.01.14</Metadata>
<Metadata name="Year">1936</Metadata>
<Metadata name="Place"></Metadata>
<Metadata name="Publisher"></Metadata>

There are Japanese characters in the fields Author_Japanese, Title and
Journal in this record; I don't know whether they will display correctly
on your end.

My collect.cfg contains the following code (still preliminary):

public true
beta true

indexes document:ArticleTitle document:Author document:AuthorJapanese
defaultindex document:ArticleTitle

plugin ZIPPlug
plugin UnknownPlug -process_exp '.PDF$'
plugin GAPlug
plugin ArcPlug
plugin RecPlug -use_metadata_files

classify AZList -metadata ArticleTitle -buttonname Title
classify AZList -metadata Author
classify AZList -metadata AuthorJapanese
classify AZList -metadata Journal -buttonname

# format Title list
format CL1VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
valign=top><strong>[ArticleTitle]</strong> - <i>{Or}{[Author]}</i>

# format Author List

format CL2VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
valign=top><strong>[Author]</strong> - <i>[ArticleTitle]</i>

#Author Japanese list
format CL3VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
valign=top><strong>[AuthorJapanese]</strong> - <i>[ArticleTitle]</i>

# Journal list
format CL4VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
valign=top><strong>[Journal]</strong> - [Author]: [ArticleTitle],
[Volume], [Pages] ([Year])</td>"

format "SearchVList" "<p>[srclink][srcicon][/srclink]
{If}{[Author],[Author]. }{If}{[Year],([Year])
}{If}{[Title],[link][Title][/link]. }{If}{[BookTitle],[BookTitle].
}{If}{[Journal],[Journal]. }{If}{[Volume],[Volume] }{If}{[Pages],[Pages].}

collectionmeta collectionname "aaa"
collectionmeta collectionextra "Test collection."

collectionmeta .document:ArticleTitle "Title"
collectionmeta .document:Author "Author"
collectionmeta .document:Journal "Journal"
collectionmeta .document:AuthorJapanese "Japanese Author"


When no Japanese is present in metadata fields, the various classifier
pages are built fine. When Japanese characters are present in the
classifier's metadata field - e.g. in the case of CL3VList in
AuthorJapanese -, the record is omitted from the list. However, when for
instance a Japanese ArticleTitle occurs in the output of the Author
list, it is printed correctly.

The search form doesn't work properly. I get a display with four input
fields, but the fields are not shown (the area below "in field" is
empty). The form buttons do not perform any actions.

For, a number of "wide character in print" errors are shown
for line 356 (or, if I use mpgg, for
line 393), and at line 470. The errors also occur

Do you have any idea what might be wrong? Currently I can't tell whether
it's me or Greenstone causing these errors :-)

Best regards, and thanks again,


Katherine Don wrote:

> Hi Birgit
> Your approach looks fine - put the PDFs in the import directory, and
> the bibliographic info into a metadata.xml file. RecPlug needs the
> -use_metadata_files option to be set. As long as the filenames in the
> metadata.xml file match the files, then the metadata will be added.
> RecPlug by itself is not enough to process your files - you need to
> have a plugin that will recognise each document type. You will need to
> add either PDFPlug, or UnknownPlug.
> PDFPlug will try to convert the PDF file to HTML, and extract metadata
> and text. If the conversion doesn't work then the document will not be
> added to the collection. This also makes importing quite slow. The
> advantage is that if text can be extracted from the pdf, then you get
> full text searching on the content. Japanese text will be fine too.
> UnknownPlug can be set up to process any kinds of documents - by
> specifying the -process_exp option. You would use something like
> -process_exp .pdf$
> It does nothing with the document itself, just adds it as an
> associated file to the Greenstone document which has the metadata.
> You get no text searching, but processing is fast, and you can search
> on all the metadata.
> If you don't want full text searching at this stage, you should use
> UnknownPlug. Another option is to put both plugins in the list, with
> PDFplug first. Then if PDFPlug can't process the file for some reason,
> it can be picked up by UnknownPlug.
> Metadata in the xml file should overwrite any metadata extracted from
> the PDF. (If you use UnknownPLug there will be none anyway.)
> If you want to add multiple values for metadata, you should use the
> mode=accumulate attribute, eg
> <Metadata name="Author">Kobayashi, Akira</Metadata>
> <Metadata name="Author" mode="accumulate">Tanaka, Jun</Metadata>
> You can use eg [Author], [Title], [Journal] etc in your format
> statements to display any of the metadata you have assigned.
> [link][icon][/link] will link to the Greenstone extracted html version
> of the document (if you have used PDFPlug), or to a bibliographic
> display (if you change the DocumentText format to display metadata).
> [srclink][srcicon][/srclink] will link to the PDF.
> For more formatting information, see
> Re metadata.xml file formats, you can specify any metadata names you
> like in there. Using the Librarian Interface, you need to specify
> metadata sets, and can only add metadata that is in the set. But if
> you are building by hand (command line) you can use any metadata names
> you like.
> I hope this answers your questions.
> The best thing to do with something like this is just to try it with a
> few documents and see what results you get.
> Regards,
> Katherine Don