[greenstone-devel] Greenstone Processing 1 out of 761 documents (was RE: Missing HTML documents)

From Gregory S. Williamson
DateTue, 16 Nov 2004 12:25:00 -0800
Subject [greenstone-devel] Greenstone Processing 1 out of 761 documents (was RE: Missing HTML documents)
Sorry for the report but this is a serious show stopper and I haven't heard anything since posting this 3 days ago ... any help would be welcome.


-----Original Message-----
From:Gregory S. Williamson
Sent:Sat 11/13/2004 6:59 PM
Subject:Missing HTML documents
Dear long-suffering developers & list lurkers,

I have recently built a collection from HTML documents (this is on a red hat linux using gsdl 2.40); the build summary shows 761 documents were considered for and processed into the collection.

When I look in the build log I see all of the HTML files being copied and then processed (with an occasional complaint telling me that the language could not be extracted and it has assumed "en" as the encoding, which is fine).

But only the last document processed by HTMLPlug seems to be showing; the subsequent steps in all import all show only some 4700 words (and some warnings for not having any data or very little data).

I don't see any warnings elsewhere save perhaps a clue in the apache warning log:
[Sat Nov 13 17:49:21 2004] [error] [client 127.0/0/1] (70007) The timeout specified has expired: ap_content_length_filter: apr_bucket_read() failed,

(the server is not connected to a network yet so this is a hand transcription; the time stamnp puts it about 2 minutes before then of the build process, so after most of the copying is done, I think)

Any suggestions as to what might be the wrong ? I sort of had my heart set on doing the other 760 files in the same collection ;-}


Greg Williamson