Re: Changing metadata after collection?

From sjboddie
DateTue, 26 Feb 2002 09:39:59 +1300
Subject Re: Changing metadata after collection?
In-Reply-To (4-3-2-7-2-20020222082641-00aea800-mail-wrlc-org)
Hi Don,

Sorry I haven't replied to this sooner.

> I have a collection of periodical issues that use the BookPlug to
> break each issue up into articles. To create browse structures on
> metadata that describe individual articles I use a modified version
> of the AZCompactList classifier (modified, for example, to use the
> classify_section method instead of the classify method). Anyway,
> it takes a very long time (about 12 hours on a 2-processor Linux
> server) to build this 700 document collection and the bulk of that
> time is in building these classifiers.
> Likely the poor performance is due to some change I made interacting
> with the recursive nature of the algorithm (I get lots of "classify
> called multiple times" warnings). Hopefully an
> that works with document sections will be in the next version of
> Greenstone. But in the meantime I need to rebuild this collection
> often because it is under development.
> Do you think I could set up a process that nightly rebuilt the
> search indexes and simple (issue-level) browse structures, and
> continued to use the old AZCompactList classifiers until maybe each
> weekend when I could rebuild the whole thing?

It should certainly be possible to do this. You'd use a process
something like the following I imagine.

1. Use "db2txt colname.ldb > colname.ldb.txt" to create a text version
of your existing gdbm database (found in the index/text directory of
your built collection).

2. Write a short perl script to strip out all the entries from
colname.ldb.txt that correspond to your AZCompactList classifier (e.g.
[CL3], [CL3.1] ...). Write these entries to another text file.

3. Rebuild the collection without including the AZCompactList

4. Merge the text file created in step 2 into the gdbm database created
by the new build. To do this you'd need to:
a) Make sure CL3 (for example) didn't already exist in the database. If
it did you'd need to rename the entries for the AZCompactList classifier
to CL4, CL9 or whatever.
b) Alter the [browse] entry in the database. This entry has a <contains>
field that specifies which classifiers are included.
c) Append the info in the text file to the database using something like
"txt2db -append colname.ldb < newinfo.txt".

You'd probably only create the AZCompactList text file once then (as
long as no other classifiers were added or removed from the collection)
just do the "txt2db -append ..." step after each rebuild.

There's no reason I can think of that the above process wouldn't work.
It might be just as easy to fix the AZCompactList classifier to prevent
it from taking so long though.