Michael, thank you for your prompt
reply.
(1) I had run import.pl shortly before running buildcol.pl,
but do not recall whether or not I invoked GLI in between. I will repeat
this execution to confirm the finding. That will take about 10 hours
execution, but is not inconvenient.
(2) I will then follow your suggestion regarding
alternative indexing engines.
Yes, I am interested in your projected work towards making
Greenstone "industrial strength", as it might make Greenstone usable by the
Computer History Museum--a prospect that I have lately begun to doubt. Am
I correct in inferring that your R&D effort makes problems such as the
current one of significant interest to you, rather than merely a distasteful
chore? I'm concerned because I would hate to impose unduly upon your (or
anyone's) goodwill, and because I anticipate the current Greenstone problem
will not be the last one I encounter.
Cheerio, Henry
Hi Henry,
(Just in case this isn't clear: I no longer work for
the NZDL Project on Greenstone.)
I'm sorry to hear that your collection
didn't build from the command-line. A couple of suggestions:
- It looks
like you've run buildcol.pl without running import.pl first. This will take
longer and probably uses more memory. Try importing your collection first, then
building it.
- If this still doesn't work, try one of the alternative
indexers to MGPP. See http://wiki.greenstone.org/wiki/index.php/Building_Greenstone_collections#What.27s_the_difference_between_MG.2C_MGPP.2C_Lucene.3F.
Use MG if it does everything you need, otherwise try Lucene.
You may be
interested to know that we have started a R&D project here at DL Consulting
to investigate the limits on building collections in Greenstone (not the GLI),
and hopefully raise these considerably. This is taking place over the next 12
months.
Regards,
Michael
H.M. Gladney wrote:
Ref: greenstone-users Digest, Vol 50, Issue
23 2007-05-18 3:31 AM message 5 (excerpt
attached) After delays for
marginally-related reasons, I again took up building a GSDL collection based
on the Snobol collection held by the Computer History Museum. Following
the advice in the attached note, I did so using the command-line procedures
described in the Greenstone Developers' Guide pp.6-10.
Problem: the buildcol.pl execution fails with
the following last few lines of the console file: GAPlug: processing HASH098b.dir/doc.xml Converting
doc.ps to TEXT
format TEXTPlug: processing home/gladney/gsdl/collect /snotest3/tmp/doc.text PSPlug: extracting PostScript
metadata from
"/home/gladney/gsdl/collect/snotest3/archives/HASH098b.dir/doc.ps" RecPlug:
getting directory
/home/gladney/gsdl/collect/snotest3/archives/HASH1adf RecPlug: getting
directory
/home/gladney/gsdl/collect/snotest3/archives/HASH1adf/9193150c.dir GAPlug:
processing HASH1adf/9193150c.dir/doc.xml numDocs: 12930 numChunkDocs:
2788 numDocsInChunk: 2790 numFrags: 15916524 numFragsInChunk: 3314387 chunkStartFragNum: 12704414 num: 14754 [num].start:
10436949 [num].here: 10436994 [num+1].start: 10436972 mgpp_passes :
Bit buffer overrun gladney@HMG3:~/gsdl$ Reminder: the collection at hand (alluded to in earlier
greenstone-users posts) consists of approx. 75000 files organized in a directory tree with over 2000 internal nodes. The space needed for this
collection is approx. 3.7Gbyte. This size is typical of other
collections anticipated in the Computer History Museum (CHM). I.e.,
without success with this sample, the CHM SW Preservation group will have to
abandon Greenstone in favor of some other digital library offering--sad, in
view of my perception that the functionality and flexibility of Greenstone
address many CHM needs!
Perhaps it is the case that this size
collection is beyond what Greenstone has been often applied to. If so,
one might conjecture that it is likely to encounter and flush out GSDL bugs
beyond what prior applications have exposed--particularly resource constraint
bugs such as memory leaks. Does the Greenstone team have an opinion
about this.
Next steps? At the moment, the only
possibility that I see is to send the GSDL team a copy of the Snobol
collection, so that it can perhaps encounter the same problem, and debug
it. Obviously that would have to be by snail-mailing a DVD copy.
Is the team interested in our doing this? If so, what address should I
use?
Cheerio, Henry H.M. Gladney, Ph.D.
http://home.pacbell.net/hgladney
============= Message: 5 Date: Fri, 18 May
2007 16:58:25 +1200 From: Michael Dewsnip <mdewsnip@cs.waikato.ac.nz> Subject:
Re: [greenstone-users] FW: GLI create stalls and
crashes To: "H.M. Gladney" <hgladney@gmail.com>
Cc: greenstone-users@list.scms.waikato.ac.nz
Unfortunately, the simple fact is that the GLI cannot
build medium to large collections. The overhead of Java and loading the
document metadata into memory means that there is often not enough memory
left for building the collection. (And I think you might be right about
a memory leak or two).
The
solution is to build the collection from the command-line. This is not difficult: you only need to run two commands, and rename a directory. For
more information about this, see the Greenstone Developer's
Guide.
Regards,
Michael DL Consulting Greenstone Digital Library and Digitisation
Specialists contact@dlconsulting.com
www.dlconsulting.com
H.M. Gladney wrote: > Conjecture: the problem,
variously observed as a stall of the GLI application, a crash of the GLI
application, or a freeze of the entire > Linux system, might be the
result of what some people call a "memory leak".
...
--
DL Consulting
Greenstone Digital Library and Digitisation Specialists
contact@dlconsulting.com
www.dlconsulting.com
|