Re: [greenstone-users] PDF collection with bibliographical metadata

From Katherine Don
DateMon, 10 Jan 2005 12:00:00 +1300
Subject Re: [greenstone-users] PDF collection with bibliographical metadata
In-Reply-To (41C96277-4000600-univie-ac-at)
Hi Birgit

Classifier problems: If Greenstone thinks a document is in English (or
it can't determine what language its in it defaults to English) then
when formatting the metadata for sorting it removes any characters that
are not a-z0-9. So for japanese metadata, it will become empty and
therefore the document will not be part of the classification.
Try adding "-default_language ja" option to UnknownPlug. All metadata
will be assumed to be in Japanese, and no formatting will be done. -
this will probably stuff up the Author classification though.
Anyway, try it and see what happens.

Alternatively you can modify the formatting done for sorting.
in gsdl/perllib/, there is a function called
format_string_english - comment out the line
$$stringref =~ s/[^a-z0-9 ]//g;
(ie put a # symbol at the start of this line)
Hopefully you will get the Japanese entries in there.

Another alternative is to use a new classifier developed by Michael for
non-English metadata. It's at
Download and unzip into your gsdl/perllib/classify directory.
Then use GenericList inplace of AZList for Japanese metadata. This
should hopefully use Japanese sort order. If you do use this and you
have success/problems, please let us know as this is still under

SearchForm problems:
I suspect that perhaps you haven't changed your collect.cfg file
properly when using mgpp? Search forms are only available with mgpp, not
with mg. teh document at
gives details about usign mgpp, alternatively you can use the Librarian
Interface, and turn advanced searching on.

If you send me your mgpp collect.cfg file I can take a look.


Birgit Kellner wrote:
> Hi Katherine,
> many, many thanks for this quick and immensely helpful reply!
> I'm working with UnknownPlug for the moment. On the whole, things work
> fine, but there seem to be problems with Japanese metadata fields, and I
> can't figure out what precisely they are.
> A FileSet-element from my metadata.xml looks like this:
> <FileSet>
> <FileName>Yamada_Ryu_1936_1769.PDF</FileName>
> <Description>
> <Metadata name="Type" >Journal article</Metadata>
> <Metadata name="Author" >Yamada, Ryujo</Metadata>
> <Metadata name="Author_Japanese">?? ??</Metadata>
> <Metadata name="Title">????????# ???????????????
> ??</Metadata>
> <Metadata name="Journal">??</Metadata>
> <Metadata name="Volume">03.04.04</Metadata>
> <Metadata name="Pages">01.01.14</Metadata>
> <Metadata name="Year">1936</Metadata>
> <Metadata name="Place"></Metadata>
> <Metadata name="Publisher"></Metadata>
> </Description>
> </FileSet>
> There are Japanese characters in the fields Author_Japanese, Title and
> Journal in this record; I don't know whether they will display correctly
> on your end.
> My collect.cfg contains the following code (still preliminary):
> creator
> maintainer
> public true
> beta true
> indexes document:ArticleTitle document:Author document:AuthorJapanese
> document:Journal
> defaultindex document:ArticleTitle
> plugin ZIPPlug
> plugin UnknownPlug -process_exp '.PDF$'
> plugin GAPlug
> plugin ArcPlug
> plugin RecPlug -use_metadata_files
> classify AZList -metadata ArticleTitle -buttonname Title
> classify AZList -metadata Author
> classify AZList -metadata AuthorJapanese
> classify AZList -metadata Journal -buttonname
> # format Title list
> format CL1VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
> valign=top><strong>[ArticleTitle]</strong> - <i>{Or}{[Author]}</i>
> {If}{[Year],[Year]}</td>"
> # format Author List
> format CL2VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
> valign=top><strong>[Author]</strong> - <i>[ArticleTitle]</i>
> {If}{[Year],[Year]}</td>"
> #Author Japanese list
> format CL3VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
> valign=top><strong>[AuthorJapanese]</strong> - <i>[ArticleTitle]</i>
> {If}{[Year],[Year]}</td>"
> # Journal list
> format CL4VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
> valign=top><strong>[Journal]</strong> - [Author]: [ArticleTitle],
> [Volume], [Pages] ([Year])</td>"
> format "SearchVList" "<p>[srclink][srcicon][/srclink]
> {If}{[Author],[Author]. }{If}{[Year],([Year])
> }{If}{[Title],[link][Title][/link]. }{If}{[BookTitle],[BookTitle].
> }{If}{[Journal],[Journal]. }{If}{[Volume],[Volume] }{If}{[Pages],[Pages].}
> collectionmeta collectionname "aaa"
> collectionmeta collectionextra "Test collection."
> collectionmeta .document:ArticleTitle "Title"
> collectionmeta .document:Author "Author"
> collectionmeta .document:Journal "Journal"
> collectionmeta .document:AuthorJapanese "Japanese Author"
> ---
> When no Japanese is present in metadata fields, the various classifier
> pages are built fine. When Japanese characters are present in the
> classifier's metadata field - e.g. in the case of CL3VList in
> AuthorJapanese -, the record is omitted from the list. However, when for
> instance a Japanese ArticleTitle occurs in the output of the Author
> list, it is printed correctly.
> The search form doesn't work properly. I get a display with four input
> fields, but the fields are not shown (the area below "in field" is
> empty). The form buttons do not perform any actions.
> For, a number of "wide character in print" errors are shown
> for line 356 (or, if I use mpgg, for
> line 393), and at line 470. The errors also occur
> with
> Do you have any idea what might be wrong? Currently I can't tell whether
> it's me or Greenstone causing these errors :-)
> Best regards, and thanks again,
> Birgit