RE: [greenstone-devel] Greenstone Processing 1 out of 761documents(was RE: Missing HTML documents)

From Gregory S. Williamson
DateTue, 16 Nov 2004 14:21:09 -0800
Subject RE: [greenstone-devel] Greenstone Processing 1 out of 761documents(was RE: Missing HTML documents)

Thanks for the reply ... I'll try your suggestions this evening -- not currently close to that machine.

This collection is being built from the collector so that may be the problem.

I'll try the manual process tonight. This server is not on a network currently and so loading Java seemed a disproportionate amount of work since I knew I could process these files and didn't need the GLI. I'll arrange to download the most recent gsdl when I can -- probably this weekend.

When I say only one document I mean that only document is showing in the collection, and all of the index creation and text processing seems to have only seen the last document listed/processed by the HTMLPlug module.

The config file looks much like this:
indexes document:text document:Title document:Source document:Subject document:Author document:Period
defaultindex document:text

plugin ZIPPlug
plugin GAPlug
plugin TEXTPlug
plugin HTMLPlug -metadata_fields Subject,Title,Author,Period
plugin EMAILPlug
plugin PDFPlug
plugin RTFPlug
plugin WordPlug
plugin PSPlug
plugin ArcPlug
plugin RecPlug

classify AZList -metadata Title
classify AZList -metadata Source
classify AZCompactList -metadata Subject -mingroup 1
classify AZCompactList -metadata Author -mingroup 1 -buttonname "Contributors"
classify AZCompactList -metadata Period -mingroup 1

and then the various format statements and then
collectionmeta collectionname "Shaping San Francisco Beta"
collectionmeta iconcollection "_httpprefix_/images/top_banner-2.gif"
collectionmeta collectionextra "\nLinux Beta test of the Shaping San Francisco project.\n_collectorextra_"
collectionmeta .document:text "text"
collectionmeta .document:Title "titles"
collectionmeta .document:Source "filenames"
collectionmeta .document:Subject "subjects"
collectionmeta .document:Author "authors"
collectionmeta .document:Period "period"

Apologies for not having included this in my earlier post.

Greg W.

-----Original Message-----
From:Katherine Don []
Sent:Tue 11/16/2004 1:59 PM
To:Gregory S. Williamson
Subject:Re: [greenstone-devel] Greenstone Processing 1 out of 761documents(was RE: Missing HTML documents)
hi greg

have you looked in the archives directory? there should be one directory per original file. (some may be subdirectories).

check some of the doc.xml files (in archives) - do they look sensible,
eg is there text inside the <Content> tags?

are you using command line building or GLI or collector? if you are
using the collector, try using command line building. if teh archives
look fine, try running with verbosity 4. do you get 761
lines of "GAPLug processing HASHxxx/doc.xml" type entries?

when you say only the last document is showing, what do you mean? a
titles list only has one document? a search only gives you one document?
what about a source classifier?

I would recommend upgrading to the latest version of Greenstone. And use the Librarian Interface for collection building, or command line
building. we don't support the collector any more.


PS, if you are having trouble with collection building, details about
how you have built the collection and the collection config file are
helpful to send with your questions.

Gregory S. Williamson wrote:
> Sorry for the report but this is a serious show stopper and I haven't heard anything since posting this 3 days ago ... any help would be welcome.
> Thanks
> -----Original Message-----
> From:Gregory S. Williamson
> Sent:Sat 11/13/2004 6:59 PM
> Cc:
> Subject:Missing HTML documents
> Dear long-suffering developers & list lurkers,
> I have recently built a collection from HTML documents (this is on a red hat linux using gsdl 2.40); the build summary shows 761 documents were considered for and processed into the collection.
> When I look in the build log I see all of the HTML files being copied and then processed (with an occasional complaint telling me that the language could not be extracted and it has assumed "en" as the encoding, which is fine).
> But only the last document processed by HTMLPlug seems to be showing; the subsequent steps in all import all show only some 4700 words (and some warnings for not having any data or very little data).
> I don't see any warnings elsewhere save perhaps a clue in the apache warning log:
> [Sat Nov 13 17:49:21 2004] [error] [client 127.0/0/1] (70007) The timeout specified has expired: ap_content_length_filter: apr_bucket_read() failed,
> referrer:
> (the server is not connected to a network yet so this is a hand transcription; the time stamnp puts it about 2 minutes before then of the build process, so after most of the copying is done, I think)
> Any suggestions as to what might be the wrong ? I sort of had my heart set on doing the other 760 files in the same collection ;-}
> Thanks,
> Greg Williamson
> _______________________________________________
> greenstone-devel mailing list