[greenstone-users] RE: Crash of buildcol.pl for largish collection

From H.M. Gladney
DateMon, 4 Jun 2007 17:03:11 -0700
Subject [greenstone-users] RE: Crash of buildcol.pl for largish collection
In-Reply-To (46649F94-7000007-cs-waikato-ac-nz)
Michael, thank you for your prompt reply.
(1) I had run import.pl shortly before running buildcol.pl, but do not recall whether or not I invoked GLI in between.  I will repeat this execution to confirm the finding.  That will take about 10 hours execution, but is not inconvenient.
(2) I will then follow your suggestion regarding alternative indexing engines.
Yes, I am interested in your projected work towards making Greenstone "industrial strength", as it might make Greenstone usable by the Computer History Museum--a prospect that I have lately begun to doubt.  Am I correct in inferring that your R&D effort makes problems such as the current one of significant interest to you, rather than merely a distasteful chore?  I'm concerned because I would hate to impose unduly upon your (or anyone's) goodwill, and because I anticipate the current Greenstone problem will not be the last one I encounter.
Cheerio, Henry

From: Michael Dewsnip [mailto:mdewsnip@cs.waikato.ac.nz]
Sent: Monday, June 04, 2007 4:26 PM
To: H.M. Gladney
Cc: greenstone-users@list.scms.waikato.ac.nz
Subject: Re: Crash of buildcol.pl for largish collection

Hi Henry,

(Just in case this isn't clear: I no longer work for the NZDL Project on Greenstone.)

I'm sorry to hear that your collection didn't build from the command-line. A couple of suggestions:

- It looks like you've run buildcol.pl without running import.pl first. This will take longer and probably uses more memory. Try importing your collection first, then building it.

- If this still doesn't work, try one of the alternative indexers to MGPP. See http://wiki.greenstone.org/wiki/index.php/Building_Greenstone_collections#What.27s_the_difference_between_MG.2C_MGPP.2C_Lucene.3F. Use MG if it does everything you need, otherwise try Lucene.

You may be interested to know that we have started a R&D project here at DL Consulting to investigate the limits on building collections in Greenstone (not the GLI), and hopefully raise these considerably. This is taking place over the next 12 months.



H.M. Gladney wrote:

Ref: greenstone-users Digest, Vol 50, Issue 23    2007-05-18 3:31 AM  message 5 (excerpt attached)
After delays for marginally-related reasons, I again took up building a GSDL collection based on the Snobol collection held by the Computer History Museum.  Following the advice in the attached note, I did so using the command-line procedures described in the Greenstone Developers' Guide pp.6-10.

Problem: the buildcol.pl execution fails with the following last few lines of the console file:
GAPlug: processing HASH098b.dir/doc.xml
doc.ps to TEXT format
TEXTPlug: processing home/gladney/gsdl/collect

PSPlug: extracting PostScript metadata from "/home/gladney/gsdl/collect/snotest3/archives/HASH098b.dir/doc.ps"
RecPlug: getting directory /home/gladney/gsdl/collect/snotest3/archives/HASH1adf
RecPlug: getting directory /home/gladney/gsdl/collect/snotest3/archives/HASH1adf/9193150c.dir
GAPlug: processing HASH1adf/9193150c.dir/doc.xml
numDocs: 12930
numChunkDocs: 2788
numDocsInChunk: 2790
numFrags: 15916524
numFragsInChunk: 3314387
chunkStartFragNum: 12704414
num: 14754
[num].start: 10436949
[num].here: 10436994
[num+1].start: 10436972
mgpp_passes : Bit buffer overrun

Reminder: the collection at hand (alluded to in earlier greenstone-users posts) consists of approx. 75000 files organized in a directory tree with over 2000 internal nodes.  The space needed for this collection is approx. 3.7Gbyte.  This size is typical of other collections anticipated in the Computer History Museum (CHM).  I.e., without success with this sample, the CHM SW Preservation group will have to abandon Greenstone in favor of some other digital library offering--sad, in view of my perception that the functionality and flexibility of Greenstone address many CHM needs!

Perhaps it is the case that this size collection is beyond what Greenstone has been often applied to.  If so, one might conjecture that it is likely to encounter and flush out GSDL bugs beyond what prior applications have exposed--particularly resource constraint bugs such as memory leaks.  Does the Greenstone team have an opinion about this.

Next steps?  At the moment, the only possibility that I see is to send the GSDL team a copy of the Snobol collection, so that it can perhaps encounter the same problem, and debug it.  Obviously that would have to be by snail-mailing a DVD copy.  Is the team interested in our doing this?  If so, what address should I use?

Cheerio, Henry
H.M. Gladney, Ph.D.  http://home.pacbell.net/hgladney
Message: 5     Date: Fri, 18 May 2007 16:58:25 +1200    
From: Michael Dewsnip <mdewsnip@cs.waikato.ac.nz>
Subject: Re: [greenstone-users] FW: GLI create stalls and crashes   
To: "H.M. Gladney" <hgladney@gmail.com>            Cc: greenstone-users@list.scms.waikato.ac.nz

Unfortunately, the simple fact is that the GLI cannot build medium to large collections. The overhead of Java and loading the document
metadata into memory means that there is often not enough memory left for building the collection. (And I think you might be right about a
memory leak or two).

The solution is to build the collection from the command-line. This is not difficult: you only need to run two commands, and rename a
directory. For more information about this, see the Greenstone Developer's Guide.


Michael      DL Consulting    Greenstone Digital Library and Digitisation Specialists

H.M. Gladney wrote:
> Conjecture: the problem, variously observed as a stall of the GLI application, a crash of the GLI application, or a freeze of the entire
> Linux system, might be the result of what some people call a "memory leak".  ...

DL Consulting
Greenstone Digital Library and Digitisation Specialists