[greenstone-users] Crash of buildcol.pl for largish collection

From H.M. Gladney
DateSun, 3 Jun 2007 18:15:51 -0700
Subject [greenstone-users] Crash of buildcol.pl for largish collection

Ref: greenstone-users Digest, Vol 50, Issue 23    2007-05-18 3:31 AM  message 5 (excerpt attached)
After delays for marginally-related reasons, I again took up building a GSDL collection based on the Snobol collection held by the Computer History Museum.  Following the advice in the attached note, I did so using the command-line procedures described in the Greenstone Developers' Guide pp.6-10.

Problem: the buildcol.pl execution fails with the following last few lines of the console file:
GAPlug: processing HASH098b.dir/doc.xml
doc.ps to TEXT format
TEXTPlug: processing home/gladney/gsdl/collect

PSPlug: extracting PostScript metadata from "/home/gladney/gsdl/collect/snotest3/archives/HASH098b.dir/doc.ps"
RecPlug: getting directory /home/gladney/gsdl/collect/snotest3/archives/HASH1adf
RecPlug: getting directory /home/gladney/gsdl/collect/snotest3/archives/HASH1adf/9193150c.dir
GAPlug: processing HASH1adf/9193150c.dir/doc.xml
numDocs: 12930
numChunkDocs: 2788
numDocsInChunk: 2790
numFrags: 15916524
numFragsInChunk: 3314387
chunkStartFragNum: 12704414
num: 14754
[num].start: 10436949
[num].here: 10436994
[num+1].start: 10436972
mgpp_passes : Bit buffer overrun

Reminder: the collection at hand (alluded to in earlier greenstone-users posts) consists of approx. 75000 files organized in a directory tree with over 2000 internal nodes.  The space needed for this collection is approx. 3.7Gbyte.  This size is typical of other collections anticipated in the Computer History Museum (CHM).  I.e., without success with this sample, the CHM SW Preservation group will have to abandon Greenstone in favor of some other digital library offering--sad, in view of my perception that the functionality and flexibility of Greenstone address many CHM needs!

Perhaps it is the case that this size collection is beyond what Greenstone has been often applied to.  If so, one might conjecture that it is likely to encounter and flush out GSDL bugs beyond what prior applications have exposed--particularly resource constraint bugs such as memory leaks.  Does the Greenstone team have an opinion about this.

Next steps?  At the moment, the only possibility that I see is to send the GSDL team a copy of the Snobol collection, so that it can perhaps encounter the same problem, and debug it.  Obviously that would have to be by snail-mailing a DVD copy.  Is the team interested in our doing this?  If so, what address should I use?

Cheerio, Henry
H.M. Gladney, Ph.D.  http://home.pacbell.net/hgladney
Message: 5     Date: Fri, 18 May 2007 16:58:25 +1200    
From: Michael Dewsnip <mdewsnip@cs.waikato.ac.nz>
Subject: Re: [greenstone-users] FW: GLI create stalls and crashes   
To: "H.M. Gladney" <hgladney@gmail.com>            Cc: greenstone-users@list.scms.waikato.ac.nz

Unfortunately, the simple fact is that the GLI cannot build medium to large collections. The overhead of Java and loading the document
metadata into memory means that there is often not enough memory left for building the collection. (And I think you might be right about a
memory leak or two).

The solution is to build the collection from the command-line. This is not difficult: you only need to run two commands, and rename a
directory. For more information about this, see the Greenstone Developer's Guide.


Michael      DL Consulting    Greenstone Digital Library and Digitisation Specialists

H.M. Gladney wrote:
> Conjecture: the problem, variously observed as a stall of the GLI application, a crash of the GLI application, or a freeze of the entire
> Linux system, might be the result of what some people call a "memory leak".  ...