[greenstone-devel] html file and metadata tags

From Gregory S. Williamson
DateWed, 15 Dec 2004 20:12:15 -0800
Subject [greenstone-devel] html file and metadata tags
I seem to be drifting backwards. Help!

I am using HTML files imported using the command line sequence outlined below.

Alas, my metadata is always being garbled. I am sure I will have one of those transcendential "Doh!" moments when someone points out what I am doing wrong -- so any advice would be welcome.

Thanks,

Greg W.

Sample HTML file header:
<HTML>
<HEAD>
<!--
<Description>
<Metadata name="Title">Bloody Thursday</Metadata>
<Metadata name="Subject">1934 General Strike</Metadata>
<Metadata name="Description">Video clip of police and rioters on Bloody Thursday. Link to more on the General Strike.</Metadata>
<Metadata name="BannerImage">GS19341</Metadata>
<Metadata name="Author">San Francisco Public Library,San Francisco,CA - Publisher or Photographer</Metadata>
</Description>
-->
</HEAD>

<b><H1>Bloody Thursday</H1> </b>

<BODY>
...
</BODY>
</html>

====================
The config file is:
creator me@shapingsf.org
maintainer me@shapingsf.org
public true

indexes document:text document:Title document:Source document:Subject document:Author document:Period
defaultindex document:text

plugin ZIPPlug
plugin GAPlug
plugin TEXTPlug
plugin HTMLPlug -metadata_fields Subject,Title,Author,Period,BannerImageplugin EMAILPlug
plugin PDFPlug
plugin RTFPlug
plugin WordPlug
plugin PSPlug
plugin ArcPlug
plugin RecPlug

classify AZList -metadata Title
classify AZList -metadata Source
classify AZCompactList -metadata Subject -mingroup 1
classify AZCompactList -metadata Author -mingroup 1 -buttonname "Contributors"
classify AZCompactList -metadata Period

format DocumentImages false
format DocumentContents false
format DocumentHeading '<img src="/gsdl/images/[BannerImage].jpg">'
format DocumentText '[Text]'

collectionmeta collectionname "1934 general Strike with metadata (2)"
collectionmeta iconcollection "/gsdl/images/top_banner-2.gif"
collectionmeta collectionextra ""
collectionmeta .document:text "text"
collectionmeta .document:Title "titles"
collectionmeta .document:Source "filenames"
collectionmeta .document:Subject "subjects"
collectionmeta .document:Period "periods"

==============
Sample XML file from this processing:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
<Archive>
<Section>
<Description>
<Metadata name="lastmodified">1103167185</Metadata>
<Metadata name="gsdlsourcefilename">import/34strike$vigilante-raids-1934.html</Metadata>
<Metadata name="gsdldoctype">indexed_doc</Metadata>
<Metadata name="Language">en</Metadata>
<Metadata name="Encoding">iso_8859_1</Metadata>
<Metadata name="Plugin">HTMLPlug</Metadata>
<Metadata name="FileSize">11015</Metadata>
<Metadata name="Source">34strike$vigilante-raids-1934.html</Metadata>
<Metadata name="Title">Title</Metadata>
<Metadata name="Subject">Subject</Metadata>
<Metadata name="BannerImage">BannerImage</Metadata>
<Metadata name="Author">Author</Metadata>
<Metadata name="Author">Author</Metadata>
<Metadata name="Author">Author</Metadata>
<Metadata name="Author">Author</Metadata>
<Metadata name="FileFormat">HTML</Metadata>
<Metadata name="URL">http://34strike$vigilante-raids-1934.html</Metadata>
<Metadata name="Identifier">HASH018fae6b6dfbe4fc85c7294c</Metadata>
...

(Note the repeated Author text ... this isn't the doc I show above but all are formatted the same way and show the same issue -- "title" is "title" ...

===============
Commands:
51 19:19 mkcol.pl -creator me@shapingsf.org ssf_346
52 19:19 pwd
53 19:19 cd ../ssf_346
54 19:19 cd ../../ssf_346
(I edit the config file here)
55 19:19 cd import
56 19:19 cp /usr/ssf_a/34* .
57 19:21 import.pl ssf_346
58 19:22 buildcol.pl ssf_346
59 19:44 history

Greg W.