Re: [greenstone-devel] Adding Metadata not already built intogreenstone

From Michael Dewsnip
DateMon, 14 Jul 2003 16:49:35 +1200
Subject Re: [greenstone-devel] Adding Metadata not already built intogreenstone
In-Reply-To (33088-130-217-244-2-1058154227-squirrel-webmail-scms-waikato-ac-nz)
Hi Greg,

Just to clarify and expand some of the points John made:

There are two ways to assign metadata to a collection. The first is to write
a metadata.xml file (see collect/demo/import/metadata.xml for an example).
The second is to assign the metadata within the documents (as you have done).

Writing a metadata.xml file has the advantage of keeping all the metadata
together, and being independent of the documents themselves. The disadvantage
is that generating new collections from existing documents is more difficult.
Since you already have all the metadata within the documents, I won't discuss
this option further (see the Greenstone Developer's Guide for more
information). Also, as John pointed out, it would not be too difficult to
write a small program to extract the metadata from your documents and write a
metadata.xml file, if necessary.

Assigning metadata within the documents is further split into two types:
document-level metadata, and section-level metadata. It is not clear from
your example whether you wish to assign document-level or section-level
metadata, so I'll describe both here.

Document-level metadata is assigned using the <Meta> tags at the top of the
files (and specifying the '-metadata_fields' option to HTMLPlug), as John
described.

Section-level metadata, on the other hand, uses the
<Section><Description><Metadata
name="...">...</Metadata></Description></Section> structure, and the
'-description_tags' option, which is very close to what you have.
Unfortunately, it seems that the <Section> tags are required - you cannot
just leave out the <Section> tags to assign metadata to the whole document.
However, if you *do* want to assign document-level metadata using this
method, you could put the entire document in one <Section>...</Section> tag.

In terms of your collection, I think the best solution to your problem is to
add <Section> tags around your <Description> elements. Obviously with a few
thousand documents you will need to automate this - either John or myself can
probably help you with this if necessary.

For more information on this, you might like to have a look at the Greenstone
Developer's Guide (if you haven't already), especially section 2.1. The
Greenstone manuals are available from our homepage
http://www.greenstone.org/english/docs.html

Hope this helps you have a little more success with Greenstone.

Regards,

Michael Dewsnip

jmt12@cs.waikato.ac.nz wrote:

> Hello Greg,
>
> No need to apologize. We're always happy to respond to postings to the
> list, especially when there are difficulties like you've experienced.
>
> In general Greenstone usually depends on additional files (metadata.xml)
> in which to find metadata, but it can be made to extract it from html
> using the HTMLPlug.
>
> There are two things you need for Greenstone to be able to extract
> metadata from source html files directly. Firstly the metadata has to be
> in Meta tags in the Head of the document, and secondly the
> -metadata_fields argument has to be provided to the HTMLPlug explaining
> what metadata you want extracted. Please see the examples of this below.
>
> However I can appreciate that hand editing the html files is too time
> consuming. I've had plently of experience moulding html into something the
> Librarian Interface can understand, and suggest that automating the html
> manipulation (using a little C/Java/PERL program, or perhaps XSLT
> transforms if you're into XML) shouldn't be too hard, depending on how
> many, and how complex, the metadata you want to extract is. If you're
> stuck you could send me more details about what metadata you want to
> extract (perhaps some example files), and I could probably code up a
> little java program to do it.
>
> Good Luck,
> John Thompson
>
> =================Fixed HTML Sample===================
>
> <HTML>
> <HEAD>
> <META name="Subject" content="African-American" />
> </HEAD>
> <BODY>
>
> <TITLE>WWII In-migration & Rising Bigotry</TITLE><b><H1>WWII In-migration
> & Rising
> Bigotry</H1> </b>
>
> <!-- <Description>
> <Metadata name="Subject"> African-American </Metadata>
> </Description> -->
>
> ...
>
> =================Collect.cfg for metadata extraction===================
>
> creator john@home
> maintainer john@home
> public true
>
> indexes document:text document:Title document:Source
> defaultindex document:text
>
> plugin ZIPPlug
> plugin GAPlug
> plugin TEXTPlug
> plugin HTMLPlug -metadata_fields Subject
> plugin EMAILPlug
> plugin PDFPlug
> plugin RTFPlug
> plugin WordPlug
> plugin PSPlug
> plugin ArcPlug
> plugin RecPlug
>
> classify AZList -metadata Subject
> classify AZList -metadata Title
> classify AZList -metadata Source
>
> collectionmeta collectionname "Extraction Test"
> collectionmeta iconcollection ""
> collectionmeta collectionextra "Testing the extraction of metadata from
> html files"
> collectionmeta .document:text "text"
> collectionmeta .document:Title "titles"
> collectionmeta .document:Source "filenames"
>
> > I apologize for taking up the list's time, but I have spent many hours
> > going nowhere with this tool.
> >
> > I have several thousand HTML documents that I wish to import; they have
> > various metadata tags associated with them to show some associated
> > information (theme, some information relating to dates the source
> > material is about, etc.).
> >
> > I have tried adding various combinations of lines in the collection file
> > but to no avail. I have read the various papers available on the
> > Greenstone site and they do nothing to help. I have bought and read a
> > book about digital libraries and it also leaves me stumped.
> >
> > Could someone PLEASE provide the minimum bits I need ? I need to know
> > exactly what to put in a configuration file to have it see and list
> > these new metadata tags. Everything I do it ignores.
> >
> > If I have to handcode every document and create entries for every single
> > metadata item I think this tool will be left in the wastebin of history.
> > I am beyond frustrated with this tool -- I spent an equivalent amount of
> > time -- 50 hours or so -- to create an interactive map with various
> > layers and interactivity using an opensource tool. With Greenstone my
> > first 50 hours of work equals a big pile of nothing.
> >
> > With fading hopes,
> >
> > Greg Williamson
> > gsw@globexplorer.com
> >
> > =================Sample HTML snippet from a source file showing a bit of
> > one "Subject" I want to code: <BODY>
> >
> > <TITLE>WWII In-migration & Rising Bigotry</TITLE><b><H1>WWII
> > In-migration & Rising Bigotry</H1> </b>
> >
> > <!--
> > <Description>
> >
> > <Metadata name="Subject"> African-American </Metadata>
> >
> > </Description>
> > -->
> > ...
> >
> > Sample snippet from a configuration file:
> >
> > ...
> > indexes document:text document:Title document:Source
> > document:Subject defaultindex document:text
> >
> > plugin ZIPPlug
> > plugin GAPlug
> > plugin TEXTPlug
> > plugin HTMLPlug-description_tags
> > plugin EMAILPlug
> > plugin PDFPlug
> > plugin RTFPlug
> > plugin WordPlug
> > plugin PSPlug
> > plugin ArcPlug
> > plugin RecPlug
> >
> >
> > classify AZList -metadata Title
> > classify AZList -metadata Source
> > classify AZList -metadata Subject -buttonname Subject
> > ...
> >
> >
> > _______________________________________________
> > greenstone-devel mailing list
> > greenstone-devel@list.scms.waikato.ac.nz
> > https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel
>
> _______________________________________________
> greenstone-devel mailing list
> greenstone-devel@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel