Re: [greenstone-users] ER1 Another new user query

From Stephen DeGabrielle
DateFri, 2 Dec 2005 00:51:22 +0930
Subject Re: [greenstone-users] ER1 Another new user query
In-Reply-To (1133446333-1qq92x70y0u8-w12-mail-sapo-pt)
Some answers below

> Citando E Robinson <>:
> > I would like to develop a digital archive of a set of lodge papers. We have
> > about 500 Word documents - one for each publication. Within each
> > publication there are one or more 'papers' , and the discussion (if
> > published) is usually an a later publication. Many of the publications
> > include minutes of meetings and administrative notes which do not need to
> > have Title / Authors etc. I have an Excel spreadsheet listing the Word
> > file, Title, Author, date of publication, number of words for each paper.
> >
> > I am not a librarian or programmer; I have read through the various manuals
> > and tutorials from the website, and am still quite confused, but hopefully
> > that will pass!
Hints to help confusion.
1. keep doing small test collections and experimenting.
2. Parts 1 and two of the developers guide are your friend. Refer to
them often.
3. The FAQ pages are a great help!
4. The greenstone Archives are a great resource.
5. has a searchable 'greenstone documentation'
collection - but you can alway build your own with the documents you
like to use. (I have the developers guide and the FAQ in the one
collection running on my home and work computers for easy access)
6. the has great example collections.

> > My questions are:
> >
> > 1.1 Is it possible to create details for different papers within the one
> > document with different title, author(s), subject(s)? How would I generate
> > metadata information for that? I think this may be possible by using
> > sections within the Word documents - can each section be treated as a
> > chapter of the whole document? I would prefer not to split the current
> > Word files - some of the papers to be indexed are quite short.

Check the developers guide page 44 'Tagging document files' covers this.
While the example only gives title metadata for each section I believe
you can also have other metadata, as well as multiple Authors and
(I think you may have to specify mode='accumulate' for multiple values)

> > 1.2 Can I avoid searches looking at selected (or not selected!) parts of
> > the Word file ? (eg minutes)?

I'm not sure - I'd try putting the whole section in a metadata element
in the section

<Metadata name="Title"> Minutes of meeting </Metadata>
<Metadata name="date">1970-01-04</Metadata>
<Metadata name="Minutes">minuutes text here
blah blax
- I believe this text doesn't get indexed with the body of the documennt
development targets </Metadata>
(noting that it wont appear as it is commented out)

> > 1.3 I have an Excel spreadsheet listing the Word file, Title, Author, date
> > of publication, subject, number of words and summary (up to about 200
> > words) for each paper - there are just under 900 sets of data. Can I use
> > that to create metadata? (I don't relish the thought of copying data for
> > 500 files / 900 papers by copying individual cells). Should I be looking to
> > include as much data as possible in the Word file? (Is metadata generated
> > from the data in File/Properties in Word?)

Yes - I believe DBPlug can do what you want.

> > 1.4 I had one Word file that did not convert when I created a test
> > collection - it contains a large number of illustrations (charts copied
> > from Excel). I was able to save to HTML format from within word. Can use
> > that HTML file in the collection? If so how?
yes you can use it - just import the HTML file with the rest of the
word documents. It should work fine.

> > 1.5 Are there any tutorials / courses for beginning users being run in
> > Wellington, NZ that I could go to?

There is notes and resources from the one, three and 4 day courses
that the developers have run on the site.

DL Consulting ( may be able to assist
with some aspects of your project. They might do training too. You
should probably contact them for a quote.

I hope this helps and good luck with your digital archive.


Stephen De Gabrielle