Re: [greenstone-users] AZLists in very large collections

From Rene Schrama
DateFri, 05 Dec 2003 10:41:54 +0100
Subject Re: [greenstone-users] AZLists in very large collections
Hi Stefan,

I rebuilt the entire collection (52,000 documents) without the
hierarchy classifier (thesaurus) and it completed in 5 hours without any
problems. Classifiers used: 1 AZList, 1 AZCompactList, 1 DateList. The
size of the exported collection was only 133 MB. The performance of the
browsing lists (after exporting and installing the collection) is beyond
expectation. Even the bulkiest page loads within seconds (and still does
after deleting the IE cache).


>>> "Stefan Boddie" <> 02-12-2003 21:45:17
Hi Rene,

The hierarchy classifier (and all other classifiers for that matter)
up data structures in memory, adding information for each document as
it is
processed. If you had a very large number of documents these data
could conceivably get quite large and use lots of memory. Having said
unless you've got 100 million or so documents I wouldn't expect it to
the sorts of limits you mention.

This would be a good thing to fix properly I think, though I don't have
to do it myself right now. If nobody else volunteers I'll add it to my
and get to it as soon as I can, though it may not make the next

I think we should do the following two things:

a) Take a look at the hierarchy classifier and work out why it's using
a lot
more memory than the other classifiers. It uses some fairly complex and
data structures that can perhaps be streamlined.

b) Create an option, at least for the hierarchy classifier but ideally
all classifiers, that allows all the temporary classification data to
stored on disk instead of in memory.

> As for the solution, I am considering the following options:
> 1. Increase physical memory from 256 Mb to 512 Mb or even 1 Gb and
> subcollections (will this also split up the AZLists??)

Increasing physical memory to as much as you can get is certainly a
start. The reason the collection is taking so long to build is because
be swapping a lot. More memory will help, though it sounds like even
won't solve all your problems.

Using subcollections won't help as the classifiers won't be split up.

> 2. Create separate collections (e.g. Dutch, English, French, German)
> and use cross-collection searching (problem: export to CD)

That's an option, yes, though I'm not a huge fan of the
searching function.

> 2. Drop some of the AZLists

The AZLists will be adding to the memory problems, though I'm not sure
how much. Try dropping all the classifiers apart from your one
classifier and see if things are any better.