Michael, thank you for your prompt
(1) I had run import.pl shortly before running buildcol.pl,
but do not recall whether or not I invoked GLI in between. I will repeat
this execution to confirm the finding. That will take about 10 hours
execution, but is not inconvenient.
(2) I will then follow your suggestion regarding
alternative indexing engines.
Yes, I am interested in your projected work towards making
Greenstone "industrial strength", as it might make Greenstone usable by the
Computer History Museum--a prospect that I have lately begun to doubt. Am
I correct in inferring that your R&D effort makes problems such as the
current one of significant interest to you, rather than merely a distasteful
chore? I'm concerned because I would hate to impose unduly upon your (or
anyone's) goodwill, and because I anticipate the current Greenstone problem
will not be the last one I encounter.
(Just in case this isn't clear: I no longer work for
the NZDL Project on Greenstone.)
I'm sorry to hear that your collection
didn't build from the command-line. A couple of suggestions:
- It looks
like you've run buildcol.pl without running import.pl first. This will take
longer and probably uses more memory. Try importing your collection first, then
- If this still doesn't work, try one of the alternative
indexers to MGPP. See http://wiki.greenstone.org/wiki/index.php/Building_Greenstone_collections#What.27s_the_difference_between_MG.2C_MGPP.2C_Lucene.3F.
Use MG if it does everything you need, otherwise try Lucene.
You may be
interested to know that we have started a R&D project here at DL Consulting
to investigate the limits on building collections in Greenstone (not the GLI),
and hopefully raise these considerably. This is taking place over the next 12
H.M. Gladney wrote:
Ref: greenstone-users Digest, Vol 50, Issue
23 2007-05-18 3:31 AM message 5 (excerpt
After delays for
marginally-related reasons, I again took up building a GSDL collection based
on the Snobol collection held by the Computer History Museum. Following
the advice in the attached note, I did so using the command-line procedures
described in the Greenstone Developers' Guide pp.6-10.
Problem: the buildcol.pl execution fails with
the following last few lines of the console file:
GAPlug: processing HASH098b.dir/doc.xml
doc.ps to TEXT
TEXTPlug: processing home/gladney/gsdl/collect
PSPlug: extracting PostScript
Bit buffer overrun
Reminder: the collection at hand (alluded to in earlier
greenstone-users posts) consists of approx. 75000 files organized in a directory tree with over 2000 internal nodes. The space needed for this
collection is approx. 3.7Gbyte. This size is typical of other
collections anticipated in the Computer History Museum (CHM). I.e.,
without success with this sample, the CHM SW Preservation group will have to
abandon Greenstone in favor of some other digital library offering--sad, in
view of my perception that the functionality and flexibility of Greenstone
address many CHM needs!
Perhaps it is the case that this size
collection is beyond what Greenstone has been often applied to. If so,
one might conjecture that it is likely to encounter and flush out GSDL bugs
beyond what prior applications have exposed--particularly resource constraint
bugs such as memory leaks. Does the Greenstone team have an opinion
Next steps? At the moment, the only
possibility that I see is to send the GSDL team a copy of the Snobol
collection, so that it can perhaps encounter the same problem, and debug
it. Obviously that would have to be by snail-mailing a DVD copy.
Is the team interested in our doing this? If so, what address should I
H.M. Gladney, Ph.D.
Message: 5 Date: Fri, 18 May
2007 16:58:25 +1200
From: Michael Dewsnip <email@example.com>
Re: [greenstone-users] FW: GLI create stalls and
To: "H.M. Gladney" <firstname.lastname@example.org>
Unfortunately, the simple fact is that the GLI cannot
build medium to large collections. The overhead of Java and loading the
metadata into memory means that there is often not enough memory
left for building the collection. (And I think you might be right about
memory leak or two).
solution is to build the collection from the command-line. This is not difficult: you only need to run two commands, and rename a
more information about this, see the Greenstone Developer's
Michael DL Consulting Greenstone Digital Library and Digitisation
H.M. Gladney wrote:
> Conjecture: the problem,
variously observed as a stall of the GLI application, a crash of the GLI
application, or a freeze of the entire
> Linux system, might be the
result of what some people call a "memory leak".
Greenstone Digital Library and Digitisation Specialists