Re: [greenstone-users] PDF collection with bibliographical metadata

From Katherine Don
DateMon, 10 Jan 2005 12:00:00 +1300
Subject Re: [greenstone-users] PDF collection with bibliographical metadata
In-Reply-To (41C96277-4000600-univie-ac-at)
Hi Birgit

Classifier problems: If Greenstone thinks a document is in English (or
it can't determine what language its in it defaults to English) then
when formatting the metadata for sorting it removes any characters that
are not a-z0-9. So for japanese metadata, it will become empty and
therefore the document will not be part of the classification.
Try adding "-default_language ja" option to UnknownPlug. All metadata
will be assumed to be in Japanese, and no formatting will be done. -
this will probably stuff up the Author classification though.
Anyway, try it and see what happens.

Alternatively you can modify the formatting done for sorting.
in gsdl/perllib/sorttools.pm, there is a function called
format_string_english - comment out the line
$$stringref =~ s/[^a-z0-9 ]//g;
(ie put a # symbol at the start of this line)
Hopefully you will get the Japanese entries in there.

Another alternative is to use a new classifier developed by Michael for
non-English metadata. It's at
http://www.cs.waikato.ac.nz/~mdewsnip/greenstone/GenericList.pm.zip
Download and unzip into your gsdl/perllib/classify directory.
Then use GenericList inplace of AZList for Japanese metadata. This
should hopefully use Japanese sort order. If you do use this and you
have success/problems, please let us know as this is still under
development.

SearchForm problems:
I suspect that perhaps you haven't changed your collect.cfg file
properly when using mgpp? Search forms are only available with mgpp, not
with mg. teh document at http://www.greenstone.org/docs/mgpp_user.pdf
gives details about usign mgpp, alternatively you can use the Librarian
Interface, and turn advanced searching on.

If you send me your mgpp collect.cfg file I can take a look.

Cheers,
Katherine

Birgit Kellner wrote:
> Hi Katherine,
>
> many, many thanks for this quick and immensely helpful reply!
> I'm working with UnknownPlug for the moment. On the whole, things work
> fine, but there seem to be problems with Japanese metadata fields, and I
> can't figure out what precisely they are.
>
> A FileSet-element from my metadata.xml looks like this:
>
> <FileSet>
> <FileName>Yamada_Ryu_1936_1769.PDF</FileName>
> <Description>
> <Metadata name="Type" >Journal article</Metadata>
> <Metadata name="Author" >Yamada, Ryujo</Metadata>
> <Metadata name="Author_Japanese">山田 龍城</Metadata>
> <Metadata name="Title">中観思想の密教化# ―特に聖提婆の心障清浄論に就い
> て―</Metadata>
> <Metadata name="Journal">文化</Metadata>
> <Metadata name="Volume">03.04.04</Metadata>
> <Metadata name="Pages">01.01.14</Metadata>
> <Metadata name="Year">1936</Metadata>
> <Metadata name="Place"></Metadata>
> <Metadata name="Publisher"></Metadata>
> </Description>
> </FileSet>
>
> There are Japanese characters in the fields Author_Japanese, Title and
> Journal in this record; I don't know whether they will display correctly
> on your end.
>
> My collect.cfg contains the following code (still preliminary):
>
> creator Birgit.Kellner@univie.ac.at
> maintainer Birgit.Kellner@univie.ac.at
> public true
> beta true
>
> indexes document:ArticleTitle document:Author document:AuthorJapanese
> document:Journal
> defaultindex document:ArticleTitle
>
> plugin ZIPPlug
> plugin UnknownPlug -process_exp '.PDF$'
> plugin GAPlug
> plugin ArcPlug
> plugin RecPlug -use_metadata_files
>
> classify AZList -metadata ArticleTitle -buttonname Title
> classify AZList -metadata Author
> classify AZList -metadata AuthorJapanese
> classify AZList -metadata Journal -buttonname
>
> # format Title list
> format CL1VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
> valign=top><strong>[ArticleTitle]</strong> - <i>{Or}{[Author]}</i>
> {If}{[Year],[Year]}</td>"
>
> # format Author List
>
> format CL2VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
> valign=top><strong>[Author]</strong> - <i>[ArticleTitle]</i>
> {If}{[Year],[Year]}</td>"
>
> #Author Japanese list
> format CL3VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
> valign=top><strong>[AuthorJapanese]</strong> - <i>[ArticleTitle]</i>
> {If}{[Year],[Year]}</td>"
>
> # Journal list
> format CL4VList "<td valign=top>[srclink][srcicon][/srclink]</td><td
> valign=top><strong>[Journal]</strong> - [Author]: [ArticleTitle],
> [Volume], [Pages] ([Year])</td>"
>
> format "SearchVList" "<p>[srclink][srcicon][/srclink]
> {If}{[Author],[Author]. }{If}{[Year],([Year])
> }{If}{[Title],[link][Title][/link]. }{If}{[BookTitle],[BookTitle].
> }{If}{[Journal],[Journal]. }{If}{[Volume],[Volume] }{If}{[Pages],[Pages].}
>
> collectionmeta collectionname "aaa"
> collectionmeta collectionextra "Test collection."
>
> collectionmeta .document:ArticleTitle "Title"
> collectionmeta .document:Author "Author"
> collectionmeta .document:Journal "Journal"
> collectionmeta .document:AuthorJapanese "Japanese Author"
>
> ---
>
> When no Japanese is present in metadata fields, the various classifier
> pages are built fine. When Japanese characters are present in the
> classifier's metadata field - e.g. in the case of CL3VList in
> AuthorJapanese -, the record is omitted from the list. However, when for
> instance a Japanese ArticleTitle occurs in the output of the Author
> list, it is printed correctly.
>
> The search form doesn't work properly. I get a display with four input
> fields, but the fields are not shown (the area below "in field" is
> empty). The form buttons do not perform any actions.
>
> For import.pl, a number of "wide character in print" errors are shown
> for mgbuildproc.pm line 356 (or, if I use mpgg, for mpggbuildproc.pm
> line 393), and doc.pm at line 470. The mgbuildproc.pm errors also occur
> with buildcol.pl.
>
> Do you have any idea what might be wrong? Currently I can't tell whether
> it's me or Greenstone causing these errors :-)
>
> Best regards, and thanks again,
>
> Birgit
>
>
>