Re: [greenstone-users] AZLists in very large collections

From Rene Schrama
DateFri, 05 Dec 2003 10:41:54 +0100
Subject Re: [greenstone-users] AZLists in very large collections
Hi Stefan,

I rebuilt the entire collection (52,000 documents) without the
hierarchy classifier (thesaurus) and it completed in 5 hours without any
problems. Classifiers used: 1 AZList, 1 AZCompactList, 1 DateList. The
size of the exported collection was only 133 MB. The performance of the
browsing lists (after exporting and installing the collection) is beyond
expectation. Even the bulkiest page loads within seconds (and still does
after deleting the IE cache).

Rene


>>> "Stefan Boddie" <sjboddie@cs.waikato.ac.nz> 02-12-2003 21:45:17
>>>
Hi Rene,

The hierarchy classifier (and all other classifiers for that matter)
build
up data structures in memory, adding information for each document as
it is
processed. If you had a very large number of documents these data
structures
could conceivably get quite large and use lots of memory. Having said
that,
unless you've got 100 million or so documents I wouldn't expect it to
reach
the sorts of limits you mention.

This would be a good thing to fix properly I think, though I don't have
time
to do it myself right now. If nobody else volunteers I'll add it to my
list
and get to it as soon as I can, though it may not make the next
release.

I think we should do the following two things:

a) Take a look at the hierarchy classifier and work out why it's using
a lot
more memory than the other classifiers. It uses some fairly complex and
ugly
data structures that can perhaps be streamlined.

b) Create an option, at least for the hierarchy classifier but ideally
for
all classifiers, that allows all the temporary classification data to
be
stored on disk instead of in memory.

> As for the solution, I am considering the following options:
> 1. Increase physical memory from 256 Mb to 512 Mb or even 1 Gb and
use
> subcollections (will this also split up the AZLists??)

Increasing physical memory to as much as you can get is certainly a
good
start. The reason the collection is taking so long to build is because
it'll
be swapping a lot. More memory will help, though it sounds like even
1Gb
won't solve all your problems.

Using subcollections won't help as the classifiers won't be split up.

> 2. Create separate collections (e.g. Dutch, English, French, German)
> and use cross-collection searching (problem: export to CD)

That's an option, yes, though I'm not a huge fan of the
cross-collection
searching function.

> 2. Drop some of the AZLists
>

The AZLists will be adding to the memory problems, though I'm not sure
by
how much. Try dropping all the classifiers apart from your one
hierarchy
classifier and see if things are any better.

Stefan.