Re: [greenstone-devel] Adding Metadata not already built into greenstone

From jmt12@cs.waikato.ac.nz
DateMon, 14 Jul 2003 15:43:47 +1200 (NZST)
Subject Re: [greenstone-devel] Adding Metadata not already built into greenstone
In-Reply-To (71E37EF6B7DCC1499CEA0316A256832801056F31-loki-globexplorer-com)
Hello Greg,

No need to apologize. We're always happy to respond to postings to the
list, especially when there are difficulties like you've experienced.

In general Greenstone usually depends on additional files (metadata.xml)
in which to find metadata, but it can be made to extract it from html
using the HTMLPlug.

There are two things you need for Greenstone to be able to extract
metadata from source html files directly. Firstly the metadata has to be
in Meta tags in the Head of the document, and secondly the
-metadata_fields argument has to be provided to the HTMLPlug explaining
what metadata you want extracted. Please see the examples of this below.

However I can appreciate that hand editing the html files is too time
consuming. I've had plently of experience moulding html into something the
Librarian Interface can understand, and suggest that automating the html
manipulation (using a little C/Java/PERL program, or perhaps XSLT
transforms if you're into XML) shouldn't be too hard, depending on how
many, and how complex, the metadata you want to extract is. If you're
stuck you could send me more details about what metadata you want to
extract (perhaps some example files), and I could probably code up a
little java program to do it.

Good Luck,
John Thompson

=================Fixed HTML Sample===================

<HTML>
<HEAD>
<META name="Subject" content="African-American" />
</HEAD>
<BODY>

<TITLE>WWII In-migration & Rising Bigotry</TITLE><b><H1>WWII In-migration
& Rising
Bigotry</H1> </b>

<!-- <Description>
<Metadata name="Subject"> African-American </Metadata>
</Description> -->

...

=================Collect.cfg for metadata extraction===================

creator john@home
maintainer john@home
public true

indexes document:text document:Title document:Source
defaultindex document:text

plugin ZIPPlug
plugin GAPlug
plugin TEXTPlug
plugin HTMLPlug -metadata_fields Subject
plugin EMAILPlug
plugin PDFPlug
plugin RTFPlug
plugin WordPlug
plugin PSPlug
plugin ArcPlug
plugin RecPlug

classify AZList -metadata Subject
classify AZList -metadata Title
classify AZList -metadata Source

collectionmeta collectionname "Extraction Test"
collectionmeta iconcollection ""
collectionmeta collectionextra "Testing the extraction of metadata from
html files"
collectionmeta .document:text "text"
collectionmeta .document:Title "titles"
collectionmeta .document:Source "filenames"

> I apologize for taking up the list's time, but I have spent many hours
> going nowhere with this tool.
>
> I have several thousand HTML documents that I wish to import; they have
> various metadata tags associated with them to show some associated
> information (theme, some information relating to dates the source
> material is about, etc.).
>
> I have tried adding various combinations of lines in the collection file
> but to no avail. I have read the various papers available on the
> Greenstone site and they do nothing to help. I have bought and read a
> book about digital libraries and it also leaves me stumped.
>
> Could someone PLEASE provide the minimum bits I need ? I need to know
> exactly what to put in a configuration file to have it see and list
> these new metadata tags. Everything I do it ignores.
>
> If I have to handcode every document and create entries for every single
> metadata item I think this tool will be left in the wastebin of history.
> I am beyond frustrated with this tool -- I spent an equivalent amount of
> time -- 50 hours or so -- to create an interactive map with various
> layers and interactivity using an opensource tool. With Greenstone my
> first 50 hours of work equals a big pile of nothing.
>
> With fading hopes,
>
> Greg Williamson
> gsw@globexplorer.com
>
> =================Sample HTML snippet from a source file showing a bit of
> one "Subject" I want to code: <BODY>
>
> <TITLE>WWII In-migration & Rising Bigotry</TITLE><b><H1>WWII
> In-migration & Rising Bigotry</H1> </b>
>
> <!--
> <Description>
>
> <Metadata name="Subject"> African-American </Metadata>
>
> </Description>
> -->
> ...
>
> Sample snippet from a configuration file:
>
> ...
> indexes document:text document:Title document:Source
> document:Subject defaultindex document:text
>
> plugin ZIPPlug
> plugin GAPlug
> plugin TEXTPlug
> plugin HTMLPlug-description_tags
> plugin EMAILPlug
> plugin PDFPlug
> plugin RTFPlug
> plugin WordPlug
> plugin PSPlug
> plugin ArcPlug
> plugin RecPlug
>
>
> classify AZList -metadata Title
> classify AZList -metadata Source
> classify AZList -metadata Subject -buttonname Subject
> ...
>
>
> _______________________________________________
> greenstone-devel mailing list
> greenstone-devel@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel