Re: [greenstone-users] AZLists in very large collections

From Stefan Boddie
DateWed, 3 Dec 2003 09:45:17 +1300
Subject Re: [greenstone-users] AZLists in very large collections
In-Reply-To (sfcc78ec-006-apps-niwi-knaw-nl)
Hi Rene,

>
> I'm afraid it's not going to work at all. I tested 12.5% of the
> collection and the build process completed in less than half an hour, so
> the entire collection should complete in 4 hours, right? Wrong! It was
> still running after 24 hours and displayed an "Out of memory!" message
> (but still running). I had a fixed paging file (Windows XP) of 1 Gb so I
> changed the max size to 4096 and rebuilt the collection. Next day: out
> of memory, still running and a Windows message about the min paging file
> size being too small. It was increased by Windows but during that
> process some memory requests were denied (according to the message). The
> size of the paging file was 1.2 Gb at the time. The out of memory error
> occured during the list processing, just after the final index phase but
> before the auxiliary files processing (which was never reached). It
> would seem that the thesaurus (hierarchy classifier) was the main cause
> of the problem because processing time increased exponentially after it
> was added, so I don't think adding another hierarchy classifier will do
> any good.
>

The hierarchy classifier (and all other classifiers for that matter) build
up data structures in memory, adding information for each document as it is
processed. If you had a very large number of documents these data structures
could conceivably get quite large and use lots of memory. Having said that,
unless you've got 100 million or so documents I wouldn't expect it to reach
the sorts of limits you mention.

This would be a good thing to fix properly I think, though I don't have time
to do it myself right now. If nobody else volunteers I'll add it to my list
and get to it as soon as I can, though it may not make the next release.

I think we should do the following two things:

a) Take a look at the hierarchy classifier and work out why it's using a lot
more memory than the other classifiers. It uses some fairly complex and ugly
data structures that can perhaps be streamlined.

b) Create an option, at least for the hierarchy classifier but ideally for
all classifiers, that allows all the temporary classification data to be
stored on disk instead of in memory.

> As for the solution, I am considering the following options:
> 1. Increase physical memory from 256 Mb to 512 Mb or even 1 Gb and use
> subcollections (will this also split up the AZLists??)

Increasing physical memory to as much as you can get is certainly a good
start. The reason the collection is taking so long to build is because it'll
be swapping a lot. More memory will help, though it sounds like even 1Gb
won't solve all your problems.

Using subcollections won't help as the classifiers won't be split up.

> 2. Create separate collections (e.g. Dutch, English, French, German)
> and use cross-collection searching (problem: export to CD)

That's an option, yes, though I'm not a huge fan of the cross-collection
searching function.

> 2. Drop some of the AZLists
>

The AZLists will be adding to the memory problems, though I'm not sure by
how much. Try dropping all the classifiers apart from your one hierarchy
classifier and see if things are any better.

Stefan.


> Any ideas, comments, advice?
>
> Rene
>
>
> >>> "Stefan Boddie" <sjboddie@cs.waikato.ac.nz> 01-12-2003 21:28:37
> >>>
> Hi Rene,
>
> You're right that AZLists and most other classifiers don't scale well
> to
> large collections. You might need to resort to setting up a Hierarchy
> classifier to nest your documents in a more complex structure.
>
> Stefan.
>
> ----- Original Message -----
> From: "Rene Schrama" <Rene.Schrama@niwi.knaw.nl>
> To: <greenstone-users@list.scms.waikato.ac.nz>
> Sent: Monday, December 01, 2003 10:49 PM
> Subject: [greenstone-users] AZLists in very large collections
>
>
> > Hi,
> >
> > I just built a collection of about 6500 documents, which is about
> 12%
> > of the entire collection which consists of about 52000 documents.
> The
> > problem is that the pages of the AZLists are already a bit chunky
> but
> > after the entire collection is built they will be huge, e.g. the
> title
> > list will have about 2000 titles on one page. Did anyone try this
> > before, and are there any known solutions for this problem?
> >
> > Rene
> >
> >
> > _______________________________________________
> > greenstone-users mailing list
> > greenstone-users@list.scms.waikato.ac.nz
> > https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
> >
>
>
>