[greenstone-devel] html file and metadata tags

From Gregory S. Williamson
DateWed, 15 Dec 2004 20:12:15 -0800
Subject [greenstone-devel] html file and metadata tags
I seem to be drifting backwards. Help!

I am using HTML files imported using the command line sequence outlined below.

Alas, my metadata is always being garbled. I am sure I will have one of those transcendential "Doh!" moments when someone points out what I am doing wrong -- so any advice would be welcome.


Greg W.

Sample HTML file header:
<Metadata name="Title">Bloody Thursday</Metadata>
<Metadata name="Subject">1934 General Strike</Metadata>
<Metadata name="Description">Video clip of police and rioters on Bloody Thursday. Link to more on the General Strike.</Metadata>
<Metadata name="BannerImage">GS19341</Metadata>
<Metadata name="Author">San Francisco Public Library,San Francisco,CA - Publisher or Photographer</Metadata>

<b><H1>Bloody Thursday</H1> </b>


The config file is:
creator me@shapingsf.org
maintainer me@shapingsf.org
public true

indexes document:text document:Title document:Source document:Subject document:Author document:Period
defaultindex document:text

plugin ZIPPlug
plugin GAPlug
plugin TEXTPlug
plugin HTMLPlug -metadata_fields Subject,Title,Author,Period,BannerImageplugin EMAILPlug
plugin PDFPlug
plugin RTFPlug
plugin WordPlug
plugin PSPlug
plugin ArcPlug
plugin RecPlug

classify AZList -metadata Title
classify AZList -metadata Source
classify AZCompactList -metadata Subject -mingroup 1
classify AZCompactList -metadata Author -mingroup 1 -buttonname "Contributors"
classify AZCompactList -metadata Period

format DocumentImages false
format DocumentContents false
format DocumentHeading '_If_("[BannerImage]" ne "",<img src="_httpprefix_/images/[BannerImage].jpg">,<P><CENTER><H2>[Subject]</H2></CENTER><P>,<P>)'
format DocumentText '[Text]'

collectionmeta collectionname "1934 general Strike with metadata (2)"
collectionmeta iconcollection "_httpprefix_/images/top_banner-2.gif"
collectionmeta collectionextra ""
collectionmeta .document:text "text"
collectionmeta .document:Title "titles"
collectionmeta .document:Source "filenames"
collectionmeta .document:Subject "subjects"
collectionmeta .document:Period "periods"

Sample XML file from this processing:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE Archive SYSTEM "http://greenstone.org/dtd/Archive/1.0/Archive.dtd">
<Metadata name="lastmodified">1103167185</Metadata>
<Metadata name="gsdlsourcefilename">import/34strike$vigilante-raids-1934.html</Metadata>
<Metadata name="gsdldoctype">indexed_doc</Metadata>
<Metadata name="Language">en</Metadata>
<Metadata name="Encoding">iso_8859_1</Metadata>
<Metadata name="Plugin">HTMLPlug</Metadata>
<Metadata name="FileSize">11015</Metadata>
<Metadata name="Source">34strike$vigilante-raids-1934.html</Metadata>
<Metadata name="Title">Title</Metadata>
<Metadata name="Subject">Subject</Metadata>
<Metadata name="BannerImage">BannerImage</Metadata>
<Metadata name="Author">Author</Metadata>
<Metadata name="Author">Author</Metadata>
<Metadata name="Author">Author</Metadata>
<Metadata name="Author">Author</Metadata>
<Metadata name="FileFormat">HTML</Metadata>
<Metadata name="URL">http://34strike$vigilante-raids-1934.html</Metadata>
<Metadata name="Identifier">HASH018fae6b6dfbe4fc85c7294c</Metadata>

(Note the repeated Author text ... this isn't the doc I show above but all are formatted the same way and show the same issue -- "title" is "title" ...

51 19:19 mkcol.pl -creator me@shapingsf.org ssf_346
52 19:19 pwd
53 19:19 cd ../ssf_346
54 19:19 cd ../../ssf_346
(I edit the config file here)
55 19:19 cd import
56 19:19 cp /usr/ssf_a/34* .
57 19:21 import.pl ssf_346
58 19:22 buildcol.pl ssf_346
59 19:44 history

Greg W.