Re: [greenstone-users] Starting the work Gatherer/gsdl

From John M Thompson
DateThu, 04 Sep 2003 12:25:06 +1200
Subject Re: [greenstone-users] Starting the work Gatherer/gsdl
In-Reply-To (BAY8-F92HvDHoSPfdci0000839c-hotmail-com)
Hi Emiliano,

Emiliano Marmonti wrote:

> 1. How could I maintain the hierarchichal data that gli generates
> from the user input. I mean, I have seen that gli generates the file
> that is set in "classify hierarchy hfile ..." automatically (could be
> that the name is no repected, automatically it generates
> dc.Subject.txt or something metadata.txt else?) from the user when
> he/she inputs using term erm erm, and it's OK. But gli mantains
> another list from the user input and perhaps this could be confusing,
> for example my user has input UNLPFacultad de Ciencias Exactas and
> has corrected it to Universidad Nacional de La PlataFacultad de
> Ciencias Exactas. I could correct the file automatically generated by
> gli because it's located in etc folder from the collection. But the
> users continues seeing UNLP from gli interface, from where could I
> delete this entry?


The value tree at the bottom on the enrich pane shows all of the
previous values assigned for a particular element, not just those that
are currently in use. In order to edit this tree you should use the
'Edit Metadata Sets' option in the 'Metadata' menu, then remove values
from the appropriate element. If you happen to remove a value that is
still in use it will just get added back again next time you save. There
was at some stage a suggestion that we should allow the user to
determine if unused values should be pruned on save - I'll add it to the
request list.

> 2. I'm putting into Greenstone documents with a lot of formulas and
> math signs. Sometimes (I think really often) PDFPlug doesn't convert
> it very well. I wish to touch the generated HTML file for correcting
> those signs or simply to erase it and leave in HTML files the
> abstract. Is there a way to do so? Could I remember the identifier of
> the file and look for the document? Where could I find it?

The html (or actually GML) versions of the imported files are found in
the archive folder, but as you suggest the tricky bit it tracking down
the correct identifier. The easiest way is if you have built the
collection, can browse the pages and then note down the HASH identifier
of the erroneous document - this is encoded in the pages url and should
be something like "&d=HASH017f5b8d1c09fdb1e6ba4e0a". Armed with this
information, you then look at the archive.inf file in the archives
folder, looking for a mapping from the full identifier to the one used
in the archives folder (which is some substring of the full identifier -
just enough to make it unique). Go to the listed file
(something/doc.xml) then get ready for a bit of an editing nightmare.
The file contains the full html of the source document, however all of
the < and > are encoded so it can take a bit of searching to find the
actual bit you're looking for.

Of course all this assumes your using a built collection. If you only
want to run import then you could try:
a) 'grep'ing the contents of the archives folder for the offending
phrase/filename
b) edit plugin which assigns HASH identifiers to also print out filename
-> HASH information (this could be really hard - I'm not even sure what
plugin is responsible for this)

Hope that helps,
John Thompson