[greenstone-users] Re: File size restrctions?

From Greenstone Team
DateFri Jul 29 19:42:55 2011
Subject [greenstone-users] Re: File size restrctions?
In-Reply-To (CAP7FF=cv9FJttKeZJwvwDLo2SN1150PqQZpMPomGocd-Yw-wFQ-mail-gmail-com)
Hi Sean,

I'm very sorry to hear about your missing collection. Do you have any
backups of it at all? Perhaps you can try some hard disk recovery
software that will retrieve your old collection?

On the latest error you are seeing:

> Error: PDF version 1.6 -- xpdf supports version 1.4
> (continuing anyway)

By default, Greenstone uses pdftohtml to convert PDFs to HTML. The
pdftohtml program works with an earlier version of "XPDF" that does most
of the PDF to HTML conversion, but which only supports PDF documents up
to and including version 1.4. The above error message appears to
indicate that the PDF that was being processed is version 1.6 and is
thus incompatible. It then tried to process it anyway and then failed in
its attempt.

Fortunately, Greenstone 2.84 works with the PDFBox extension to allow
processing more recent versions of PDFs. Please consult the section on
the PDFBox extension at
http://wiki.greenstone.org/wiki/index.php/2.84_Release_Notes#Extensions
It contains details on how to set things up in Greenstone 2.84 to use
it. If you have further questions, do write back.

Best of luck,
Anupama


Sean Mitchuson wrote:
> Thanks for that information sadly the collection seems to be gone so I
> can't try this method. We've had a new issue arise though. We are
> trying to upload large pdf files into Greenstone (around 75 megs) and
> either the upload doesn't take or this happens when building and
> importing
>
> import.pl <http://import.pl/>> perl -S incremental-import.pl
> <http://incremental-import.pl/> -collectdir
> "E:Greenstonecollect" -gli -language "en" dullrich-yearbo40 2>&1
> import.pl <http://import.pl/>> *****
> import.pl <http://import.pl/>> First time import. Switching to
> full import.pl <http://import.pl/>.
> import.pl <http://import.pl/>> *****
> import.pl <http://import.pl/>> UnknownPlugin Warning: Non-recursive
> plugin has no process_exp
> import.pl <http://import.pl/>> Removing current contents of the
> archives directory...
> import.pl <http://import.pl/>> Removing contents of the collection
> "tmp" directory...
> import.pl <http://import.pl/>> Global file scan checking directory:
> E:Greenstonecollectdullrich-yearbo40import
> import.pl <http://import.pl/>>
> import.pl <http://import.pl/>> MetadataXMLPlugin: processing metadata.xml
> import.pl <http://import.pl/>> Converting test2.pdf to HTML format
> import.pl <http://import.pl/>> Error executing pdftohtml.pl
> <http://pdftohtml.pl/>
> import.pl <http://import.pl/>> pdftohtml error log:
> import.pl <http://import.pl/>> Error: PDF version 1.6 -- xpdf supports
> version 1.4
> (continuing anyway)
> import.pl <http://import.pl/>> Error (0): PDF file is damaged -
> attempting to reconstruct
> xref table...
> import.pl <http://import.pl/>> Error: Couldn't find trailer dictionary
> import.pl <http://import.pl/>> Error: Couldn't read xref table
> import.pl <http://import.pl/>> Could not convert test2.pdf to HTML format
> import.pl <http://import.pl/>> Error: PDF version 1.6 -- xpdf supports
> version 1.4
> (continuing anyway)
> import.pl <http://import.pl/>> Error (0): PDF file is damaged -
> attempting to reconstruct
> xref table...
> import.pl <http://import.pl/>> Error: Couldn't find trailer dictionary
> import.pl <http://import.pl/>> Error: Couldn't read xref table
> import.pl <http://import.pl/>>
> import.pl <http://import.pl/>> WARNING: No plugin could process test2.pdf
> import.pl <http://import.pl/>>
> import.pl <http://import.pl/>>
> *********************************************
> import.pl <http://import.pl/>> Import complete
> import.pl <http://import.pl/>>
> *********************************************
> import.pl <http://import.pl/>> * 1 document was considered for processing
> import.pl <http://import.pl/>> * 0 were processed and included in the
> collection
> import.pl <http://import.pl/>> * 1 was rejected
> import.pl <http://import.pl/>> See
> E:Greenstonecollectdullrich-yearbo40etcfail.log
> for a list of unrecognised and/or rejected documents
> import.pl <http://import.pl/>> Extracting new metadata from archive files.
> import.pl <http://import.pl/>> Archived metadata extraction complete.
> buildcol.pl <http://buildcol.pl/>> perl -S full-buildcol.pl
> <http://full-buildcol.pl/> -collectdir
> "E:Greenstonecollect" -gli -language "en" dullrich-yearbo40 2>&1
> buildcol.pl <http://buildcol.pl/>> UnknownPlugin Warning:
> Non-recursive plugin has no process_exp
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> *** creating the compressed text
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> collecting text statistics
> (mgpp_passes -T1)
> buildcol.pl <http://buildcol.pl/>> WARNING: No plugin could recognise
> archiveinf-src.gdb
> buildcol.pl <http://buildcol.pl/>> Stats (Compressing text from text)
> buildcol.pl <http://buildcol.pl/>> Total bytes in collection: 0
> buildcol.pl <http://buildcol.pl/>> Total bytes in text: 0
> buildcol.pl <http://buildcol.pl/>> ***************
> buildcol.pl <http://buildcol.pl/>> WARNING: There is very little or no
> text to compress
> buildcol.pl <http://buildcol.pl/>> Was this your intention?
> buildcol.pl <http://buildcol.pl/>> ***************
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> creating the compression dictionary
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> compressing the text
> (mgpp_passes -T2)
> buildcol.pl <http://buildcol.pl/>> WARNING: No plugin could recognise
> archiveinf-src.gdb
> buildcol.pl <http://buildcol.pl/>> Stats (Compressing text from text)
> buildcol.pl <http://buildcol.pl/>> Total bytes in collection: 0
> buildcol.pl <http://buildcol.pl/>> Total bytes in text: 0
> buildcol.pl <http://buildcol.pl/>> ***************
> buildcol.pl <http://buildcol.pl/>> WARNING: There is very little or no
> text to compress
> buildcol.pl <http://buildcol.pl/>> Was this your intention?
> buildcol.pl <http://buildcol.pl/>> ***************
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> *** building index
> text;dc.Title,ex.Title;ex.Source; in
> subdirectory idx
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> creating index dictionary
> (mgpp_passes -I1)
> buildcol.pl <http://buildcol.pl/>> WARNING: No plugin could recognise
> archiveinf-src.gdb
> buildcol.pl <http://buildcol.pl/>> Stats (Creating index
> text;dc.Title,ex.Title;ex.Source;)
> buildcol.pl <http://buildcol.pl/>> Total bytes in collection: 0
> buildcol.pl <http://buildcol.pl/>> Total bytes in
> text;dc.Title,ex.Title;ex.Source;: 0
> buildcol.pl <http://buildcol.pl/>> ***************
> buildcol.pl <http://buildcol.pl/>> WARNING: There is very little or no
> text to process for
> text;dc.Title,ex.Title;ex.Source;
> buildcol.pl <http://buildcol.pl/>> Was this your intention?
> buildcol.pl <http://buildcol.pl/>> ***************
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> inverting the text (mgpp_passes
> -I2)
> buildcol.pl <http://buildcol.pl/>> WARNING: No plugin could recognise
> archiveinf-src.gdb
> buildcol.pl <http://buildcol.pl/>> Stats (Creating index
> text;dc.Title,ex.Title;ex.Source;)
> buildcol.pl <http://buildcol.pl/>> Total bytes in collection: 0
> buildcol.pl <http://buildcol.pl/>> Total bytes in
> text;dc.Title,ex.Title;ex.Source;: 0
> buildcol.pl <http://buildcol.pl/>> ***************
> buildcol.pl <http://buildcol.pl/>> WARNING: There is very little or no
> text to process for
> text;dc.Title,ex.Title;ex.Source;
> buildcol.pl <http://buildcol.pl/>> Was this your intention?
> buildcol.pl <http://buildcol.pl/>> ***************
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> create the weights file
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> creating 'on-disk' stemmed
> dictionary
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> creating stem indexes
> buildcol.pl <http://buildcol.pl/>> BuildDir:
> E:/Greenstone/collect/dullrich-yearbo40/building
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> *** creating the info database and
> processing associated files
> buildcol.pl <http://buildcol.pl/>> WARNING: No plugin could recognise
> archiveinf-src.gdb
> buildcol.pl <http://buildcol.pl/>> Use of uninitialized value in
> string eq at
> E:Greenstone/perllib/mgppbuilder.pm <http://mgppbuilder.pm/> line 624.
> buildcol.pl <http://buildcol.pl/>> Warning: No metadata values
> assigned to dc.Title;ex.Title.
> buildcol.pl <http://buildcol.pl/>> *** outputting information for
> classifier: CL1
> buildcol.pl <http://buildcol.pl/>> Warning: No metadata values
> assigned to ex.Source.
> buildcol.pl <http://buildcol.pl/>> *** outputting information for
> classifier: CL2
> buildcol.pl <http://buildcol.pl/>> *** outputting information for
> classifier: oai
> buildcol.pl <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl/>> *** creating auxiliary files
>
> On Wed, Jul 27, 2011 at 9:46 PM, Greenstone Team
> <greenstone_team@cs.waikato.ac.nz
> <mailto:greenstone_team@cs.waikato.ac.nz>> wrote:
>
> Hi Sean,
>
> Sam here suspects that it could be due to too many file handles
> being held open. It may be a known issue with GLI that it can't
> handle collections that are too large.
> Can you try to access the server machine and rebuild the
> collection from the command line on there (see below for
> instructions)? Does that work?
>
> Instructions on rebuilding the collection from the command-line.
> Read the following through the end first, before trying it out.
>
> 1. Try to ssh into the remote machine where the GS server lives,
> or otherwise try to gain direct access to the machine.
>
> 2. Stop your Greenstone 2 web server.
>
> 3. Open up a terminal (like an x-term on Linux, DOS prompt on
> Windows) and cd into your Greenstone installation folder. Note
> that the ">" angle bracket represents a new line of your waiting
> command prompt (don't type it):
> > cd "C:Program FilesGreenstone2"
>
> 4. Next, run the setup script to setup Greenstone's environment.
> On Windows:
> > setup.bat
>
> On Linux:
> > source setup.bash
>
> 5. First, decide on whether you want to try incremental building
> in an attempt to save some time, or whether you think your
> collection may have become corrupted and you require a proper
> rebuild. Your collection is very huge, and so time-saving measures
> are something to consider:
>
> (i) If you want to try incremental building, then after each ".pl"
> below, type the word "-incremental" (without quotes) before or
> after the word "-keepold" already in ALL the commands in step 6
> below. Make sure to put a space or more between -incremental and
> -keepold.
> (ii) If you suspect your index folder is corrupted and incremental
> building can't fix the fundamental flaws, but you hopefully
> anticipate that your archives folder may have survived intact,
> just leave the "-keepold" flag in (don't add in "-incremental").
> No need to change any of the commands in step 6.
> (iii) If you think even your collection's archives folder and not
> just its index folder may have become corrupted, *replace* the
> "-keepold" flag in ALL the commands in step 6 below with
> "-removeold". (But again, don't add any "-incremental").
>
> I would go with (iii), but only AFTER moving your collection's
> current "index" and "archives" folders out of the way, to keep
> some sort of backup of them. (Your collection's "index" and
> "archives" folders are located in your GS2'installations
> collect/<collection name> directory). Moving them elsewhere will
> also tell if your OS is holding a lock on any index files, since
> Windows often does this and that can break the building process.
> Option (iii) may take the longest but at least you'd have tried it
> all in one go.
>
>
> 6. Now, you are ready to start the 3 step manual collection
> building process:
>
> a. IMPORTING
>
> On Windows:
> > perl -S import.pl <http://import.pl> -keepold <type your
> collection's name here>
>
> On Linux:
> > import.pl <http://import.pl> -keepold <type your collection's
> name here>
> (If that didn't work, plug the word "perl" in front of the Linux
> command).
>
> NOTE: If your collect folder is located elsewhere, add in the
> -collectdir flag to the command and provide the full path to your
> non-standard "collect" directory as follows:
> > perl -S import.pl <http://import.pl> -collectdir
> "full/path/to/your/external/collect" -keepold <type your
> collection's name here>
> Or on Linux:
> > import.pl <http://import.pl> -collectdir
> "full/path/to/your/external/collect" -keepold <type your
> collection's name here>
>
> It's likely the above will spend a long time trying to import your
> 14Gb worth of documents. Once that's done at last, the prompt will
> return to you. At which stage you need to perform the next stage:
>
> b. BUILDING
>
> On Windows:
> > perl -S buildcol.pl <http://buildcol.pl> -keepold <type your
> collection's name here>
>
> On Linux:
> > buildcol.pl <http://buildcol.pl> -keepold <type your
> collection's name here>
> (If that didn't work, plug the word "perl" in front of the Linux
> command)
>
> Once again, if your collect directory is different from the
> standard GS2 "collect" folder, additionally specify the
> -collectdir <"full path to your collect folder"> option to the
> buildcol command.
>
> It may take a very long time again to build your collection. But
> if it succeeds, you can move onto the 3rd stage of the rebuilding
> process:
>
> c. MOVING FOLDERS "BUILDING" TO "INDEX"
>
> Rebuilding manually from the command-line generates a folder
> called "building" inside your collect/<collection-name folder>. If
> you see any folder called "index" in here, then move it far out of
> the way (or delete it, if you feel confident). Then rename
> "building" to "index".
> While GLI does this step automatically for you, manual rebuilding
> does not.
>
>
> 7. If you saw no errors during any stage of the rebuilding process
> of step 6, it's a fair indication that things were okay. But to
> make fully sure, restart your GS2 web server and visit its home
> page and then go to your rebuilt collection and see if it still works.
>
>
> Write back if you encounter any error messages during step 6 or
> anything that goes visibly wrong in step 7 (or any of the steps).
>
> All the best,
> Anupama
>
>
> Sean Mitchuson wrote:
>
> We have been working on a collection that is around 14gb worth
> of data and is mostly pdf files. Recently after a upload
> session we can no longer access the collection. Every time we
> try to open it through the GLI it sits and waits for minutes
> (up to 20 at last check) and then gives us a 500 error for
> gliserver.pl <http://gliserver.pl> <http://gliserver.pl>
> Is this collection ruined? Or is there a way to save it?
> Thanks,
>
> --
> Sean Mitchuson
> Library Tech Coordinator
> Murray State University
> Murray, Ky
> Phone: 270.809.4773 <tel:270.809.4773>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> <mailto:greenstone-users@list.scms.waikato.ac.nz>
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>
>
>
>
>
>
> --
> Sean Mitchuson
> Library Tech Coordinator
> Murray State University
> Murray, Ky
> Phone: 270.809.4773
>