[greenstone-users] Re: File size restrctions?

From Greenstone Team
DateMon Aug 1 13:56:46 2011
Subject [greenstone-users] Re: File size restrctions?
In-Reply-To (CAP7FF=eTVnK1RGVXDCS8JG1dcmnSenc8TGzC=955hYJ78SRukw-mail-gmail-com)
Hi Sean,

> Our greenstone installation doesn't seem to have an ext folder is
this something I can just create?

Do you have Greenstone 2.84 installed? You need 2.84 since that's the
version that works with PDFBox and not earlier versions.

If you do have 2.84 installed, then go ahead and create a folder called
"ext" inside your Greenstone installation.


If you don't have 2.84 yet, then the top of the page
http://www.greenstone.org/download links to both the 2.84 binary and its
release notes. Do read the patches section of the release notes at
http://wiki.greenstone.org/wiki/index.php/2.84_Release_Notes#Patches_to_2.84

You also have the option of trying out a work-in-progress version of the
upcoming 2.85, since it has improvements over 2.84. However, bear in
mind that it *is* a work in progress version. If you choose to try this
out anyway, the version generated overnight is available from
http://www.greenstone.org/caveat-emptor/
Go down to the section "Greenstone2 Binary Releases" and download the
binary for your OS. If the current binary for your OS is not listed or
doesn't work due to some glitch, you can try ones from the days
preceding it by clicking on one of the date links at the top of the page.

Regards,
Anupama


Sean Mitchuson wrote:
> Our greenstone installation doesn't seem to have an ext folder is this
> something I can just create?
>
> On Fri, Jul 29, 2011 at 2:42 AM, Greenstone Team
> <greenstone_team@cs.waikato.ac.nz
> <mailto:greenstone_team@cs.waikato.ac.nz>> wrote:
>
> Hi Sean,
>
> I'm very sorry to hear about your missing collection. Do you have
> any backups of it at all? Perhaps you can try some hard disk
> recovery software that will retrieve your old collection?
>
> On the latest error you are seeing:
>
>
> Error: PDF version 1.6 -- xpdf supports version 1.4
> (continuing anyway)
>
>
> By default, Greenstone uses pdftohtml to convert PDFs to HTML. The
> pdftohtml program works with an earlier version of "XPDF" that
> does most of the PDF to HTML conversion, but which only supports
> PDF documents up to and including version 1.4. The above error
> message appears to indicate that the PDF that was being processed
> is version 1.6 and is thus incompatible. It then tried to process
> it anyway and then failed in its attempt.
>
> Fortunately, Greenstone 2.84 works with the PDFBox extension to
> allow processing more recent versions of PDFs. Please consult the
> section on the PDFBox extension at
> http://wiki.greenstone.org/wiki/index.php/2.84_Release_Notes#Extensions
> It contains details on how to set things up in Greenstone 2.84 to
> use it. If you have further questions, do write back.
>
> Best of luck,
> Anupama
>
>
> Sean Mitchuson wrote:
>
> Thanks for that information sadly the collection seems to be
> gone so I can't try this method. We've had a new issue arise
> though. We are trying to upload large pdf files into
> Greenstone (around 75 megs) and either the upload doesn't take
> or this happens when building and importing
>
> import.pl <http://import.pl> <http://import.pl/>> perl -S
> incremental-import.pl <http://incremental-import.pl>
> <http://incremental-import.pl/> -collectdir
>
> "E:Greenstonecollect" -gli -language "en" dullrich-yearbo40 2>&1
> import.pl <http://import.pl> <http://import.pl/>> *****
> import.pl <http://import.pl> <http://import.pl/>> First time
> import. Switching to full import.pl <http://import.pl>
> <http://import.pl/>.
> import.pl <http://import.pl> <http://import.pl/>> *****
> import.pl <http://import.pl> <http://import.pl/>>
> UnknownPlugin Warning: Non-recursive plugin has no process_exp
> import.pl <http://import.pl> <http://import.pl/>> Removing
> current contents of the archives directory...
> import.pl <http://import.pl> <http://import.pl/>> Removing
> contents of the collection "tmp" directory...
> import.pl <http://import.pl> <http://import.pl/>> Global file
> scan checking directory:
>
> E:Greenstonecollectdullrich-yearbo40import
> import.pl <http://import.pl> <http://import.pl/>>
> import.pl <http://import.pl> <http://import.pl/>>
> MetadataXMLPlugin: processing metadata.xml
> import.pl <http://import.pl> <http://import.pl/>> Converting
> test2.pdf to HTML format
> import.pl <http://import.pl> <http://import.pl/>> Error
> executing pdftohtml.pl <http://pdftohtml.pl>
> <http://pdftohtml.pl/>
> import.pl <http://import.pl> <http://import.pl/>> pdftohtml
> error log:
> import.pl <http://import.pl> <http://import.pl/>> Error: PDF
> version 1.6 -- xpdf supports version 1.4
> (continuing anyway)
> import.pl <http://import.pl> <http://import.pl/>> Error (0):
> PDF file is damaged - attempting to reconstruct
> xref table...
> import.pl <http://import.pl> <http://import.pl/>> Error:
> Couldn't find trailer dictionary
> import.pl <http://import.pl> <http://import.pl/>> Error:
> Couldn't read xref table
> import.pl <http://import.pl> <http://import.pl/>> Could not
> convert test2.pdf to HTML format
> import.pl <http://import.pl> <http://import.pl/>> Error: PDF
> version 1.6 -- xpdf supports version 1.4
> (continuing anyway)
> import.pl <http://import.pl> <http://import.pl/>> Error (0):
> PDF file is damaged - attempting to reconstruct
> xref table...
> import.pl <http://import.pl> <http://import.pl/>> Error:
> Couldn't find trailer dictionary
> import.pl <http://import.pl> <http://import.pl/>> Error:
> Couldn't read xref table
> import.pl <http://import.pl> <http://import.pl/>>
> import.pl <http://import.pl> <http://import.pl/>> WARNING: No
> plugin could process test2.pdf
> import.pl <http://import.pl> <http://import.pl/>>
> import.pl <http://import.pl> <http://import.pl/>>
> *********************************************
> import.pl <http://import.pl> <http://import.pl/>> Import complete
> import.pl <http://import.pl> <http://import.pl/>>
> *********************************************
> import.pl <http://import.pl> <http://import.pl/>> * 1 document
> was considered for processing
> import.pl <http://import.pl> <http://import.pl/>> * 0 were
> processed and included in the collection
> import.pl <http://import.pl> <http://import.pl/>> * 1 was rejected
> import.pl <http://import.pl> <http://import.pl/>> See
> E:Greenstonecollectdullrich-yearbo40etcfail.log
>
> for a list of unrecognised and/or rejected documents
> import.pl <http://import.pl> <http://import.pl/>> Extracting
> new metadata from archive files.
> import.pl <http://import.pl> <http://import.pl/>> Archived
> metadata extraction complete.
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> perl
> -S full-buildcol.pl <http://full-buildcol.pl>
> <http://full-buildcol.pl/> -collectdir
>
> "E:Greenstonecollect" -gli -language "en" dullrich-yearbo40 2>&1
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> UnknownPlugin Warning: Non-recursive plugin has no process_exp
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> ***
> creating the compressed text
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> collecting text statistics (mgpp_passes -T1)
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> WARNING: No plugin could recognise archiveinf-src.gdb
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Stats
> (Compressing text from text)
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Total
> bytes in collection: 0
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Total
> bytes in text: 0
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> ***************
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> WARNING: There is very little or no text to compress
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> Was this your intention?
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> ***************
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> creating the compression dictionary
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> compressing the text (mgpp_passes -T2)
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> WARNING: No plugin could recognise archiveinf-src.gdb
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Stats
> (Compressing text from text)
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Total
> bytes in collection: 0
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Total
> bytes in text: 0
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> ***************
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> WARNING: There is very little or no text to compress
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> Was this your intention?
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> ***************
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> ***
> building index text;dc.Title,ex.Title;ex.Source; in
> subdirectory idx
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> creating index dictionary (mgpp_passes -I1)
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> WARNING: No plugin could recognise archiveinf-src.gdb
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Stats
> (Creating index text;dc.Title,ex.Title;ex.Source;)
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Total
> bytes in collection: 0
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Total
> bytes in text;dc.Title,ex.Title;ex.Source;: 0
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> ***************
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> WARNING: There is very little or no text to process for
>
> text;dc.Title,ex.Title;ex.Source;
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> Was this your intention?
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> ***************
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> inverting the text (mgpp_passes -I2)
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> WARNING: No plugin could recognise archiveinf-src.gdb
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Stats
> (Creating index text;dc.Title,ex.Title;ex.Source;)
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Total
> bytes in collection: 0
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Total
> bytes in text;dc.Title,ex.Title;ex.Source;: 0
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> ***************
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> WARNING: There is very little or no text to process for
>
> text;dc.Title,ex.Title;ex.Source;
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> Was this your intention?
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> ***************
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> create the weights file
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> creating 'on-disk' stemmed dictionary
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> creating stem indexes
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> BuildDir: E:/Greenstone/collect/dullrich-yearbo40/building
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> ***
> creating the info database and processing associated files
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> WARNING: No plugin could recognise archiveinf-src.gdb
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> Use of
> uninitialized value in string eq at
> E:Greenstone/perllib/mgppbuilder.pm <http://mgppbuilder.pm>
> <http://mgppbuilder.pm/> line 624.
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> Warning: No metadata values assigned to dc.Title;ex.Title.
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> ***
> outputting information for classifier: CL1
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> Warning: No metadata values assigned to ex.Source.
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> ***
> outputting information for classifier: CL2
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> ***
> outputting information for classifier: oai
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>>
> buildcol.pl <http://buildcol.pl> <http://buildcol.pl/>> ***
> creating auxiliary files
>
>
> On Wed, Jul 27, 2011 at 9:46 PM, Greenstone Team
> <greenstone_team@cs.waikato.ac.nz
> <mailto:greenstone_team@cs.waikato.ac.nz>
> <mailto:greenstone_team@cs.waikato.ac.nz
> <mailto:greenstone_team@cs.waikato.ac.nz>>> wrote:
>
> Hi Sean,
>
> Sam here suspects that it could be due to too many file handles
> being held open. It may be a known issue with GLI that it can't
> handle collections that are too large.
> Can you try to access the server machine and rebuild the
> collection from the command line on there (see below for
> instructions)? Does that work?
>
> Instructions on rebuilding the collection from the
> command-line.
> Read the following through the end first, before trying it out.
>
> 1. Try to ssh into the remote machine where the GS server
> lives,
> or otherwise try to gain direct access to the machine.
>
> 2. Stop your Greenstone 2 web server.
>
> 3. Open up a terminal (like an x-term on Linux, DOS prompt on
> Windows) and cd into your Greenstone installation folder. Note
> that the ">" angle bracket represents a new line of your
> waiting
> command prompt (don't type it):
> > cd "C:Program FilesGreenstone2"
>
> 4. Next, run the setup script to setup Greenstone's
> environment.
> On Windows:
> > setup.bat
>
> On Linux:
> > source setup.bash
>
> 5. First, decide on whether you want to try incremental
> building
> in an attempt to save some time, or whether you think your
> collection may have become corrupted and you require a proper
> rebuild. Your collection is very huge, and so time-saving
> measures
> are something to consider:
>
> (i) If you want to try incremental building, then after
> each ".pl"
> below, type the word "-incremental" (without quotes) before or
> after the word "-keepold" already in ALL the commands in step 6
> below. Make sure to put a space or more between
> -incremental and
> -keepold.
> (ii) If you suspect your index folder is corrupted and
> incremental
> building can't fix the fundamental flaws, but you hopefully
> anticipate that your archives folder may have survived intact,
> just leave the "-keepold" flag in (don't add in
> "-incremental").
> No need to change any of the commands in step 6.
> (iii) If you think even your collection's archives folder
> and not
> just its index folder may have become corrupted, *replace* the
> "-keepold" flag in ALL the commands in step 6 below with
> "-removeold". (But again, don't add any "-incremental").
>
> I would go with (iii), but only AFTER moving your collection's
> current "index" and "archives" folders out of the way, to keep
> some sort of backup of them. (Your collection's "index" and
> "archives" folders are located in your GS2'installations
> collect/<collection name> directory). Moving them elsewhere
> will
> also tell if your OS is holding a lock on any index files,
> since
> Windows often does this and that can break the building
> process.
> Option (iii) may take the longest but at least you'd have
> tried it
> all in one go.
>
>
> 6. Now, you are ready to start the 3 step manual collection
> building process:
>
> a. IMPORTING
>
> On Windows:
> > perl -S import.pl <http://import.pl> <http://import.pl>
> -keepold <type your
>
> collection's name here>
>
> On Linux:
> > import.pl <http://import.pl> <http://import.pl> -keepold
> <type your collection's
>
> name here>
> (If that didn't work, plug the word "perl" in front of the
> Linux
> command).
>
> NOTE: If your collect folder is located elsewhere, add in the
> -collectdir flag to the command and provide the full path
> to your
> non-standard "collect" directory as follows:
> > perl -S import.pl <http://import.pl> <http://import.pl>
> -collectdir
>
> "full/path/to/your/external/collect" -keepold <type your
> collection's name here>
> Or on Linux:
> > import.pl <http://import.pl> <http://import.pl> -collectdir
>
> "full/path/to/your/external/collect" -keepold <type your
> collection's name here>
>
> It's likely the above will spend a long time trying to
> import your
> 14Gb worth of documents. Once that's done at last, the
> prompt will
> return to you. At which stage you need to perform the next
> stage:
>
> b. BUILDING
>
> On Windows:
> > perl -S buildcol.pl <http://buildcol.pl>
> <http://buildcol.pl> -keepold <type your
>
> collection's name here>
>
> On Linux:
> > buildcol.pl <http://buildcol.pl> <http://buildcol.pl>
> -keepold <type your
>
> collection's name here>
> (If that didn't work, plug the word "perl" in front of the
> Linux
> command)
>
> Once again, if your collect directory is different from the
> standard GS2 "collect" folder, additionally specify the
> -collectdir <"full path to your collect folder"> option to the
> buildcol command.
>
> It may take a very long time again to build your
> collection. But
> if it succeeds, you can move onto the 3rd stage of the
> rebuilding
> process:
>
> c. MOVING FOLDERS "BUILDING" TO "INDEX"
>
> Rebuilding manually from the command-line generates a folder
> called "building" inside your collect/<collection-name
> folder>. If
> you see any folder called "index" in here, then move it far
> out of
> the way (or delete it, if you feel confident). Then rename
> "building" to "index".
> While GLI does this step automatically for you, manual
> rebuilding
> does not.
>
>
> 7. If you saw no errors during any stage of the rebuilding
> process
> of step 6, it's a fair indication that things were okay. But to
> make fully sure, restart your GS2 web server and visit its home
> page and then go to your rebuilt collection and see if it
> still works.
>
>
> Write back if you encounter any error messages during step 6 or
> anything that goes visibly wrong in step 7 (or any of the
> steps).
>
> All the best,
> Anupama
>
>
> Sean Mitchuson wrote:
>
> We have been working on a collection that is around
> 14gb worth
> of data and is mostly pdf files. Recently after a upload
> session we can no longer access the collection. Every
> time we
> try to open it through the GLI it sits and waits for
> minutes
> (up to 20 at last check) and then gives us a 500 error for
> gliserver.pl <http://gliserver.pl>
> <http://gliserver.pl> <http://gliserver.pl>
>
> Is this collection ruined? Or is there a way to save
> it? Thanks,
>
> -- Sean Mitchuson
> Library Tech Coordinator
> Murray State University
> Murray, Ky
> Phone: 270.809.4773 <tel:270.809.4773>
> <tel:270.809.4773 <tel:270.809.4773>>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> <mailto:greenstone-users@list.scms.waikato.ac.nz>
> <mailto:greenstone-users@list.scms.waikato.ac.nz
> <mailto:greenstone-users@list.scms.waikato.ac.nz>>
>
>
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>
>
>
>
>
> --
> Sean Mitchuson
> Library Tech Coordinator
> Murray State University
> Murray, Ky
> Phone: 270.809.4773 <tel:270.809.4773>
>
>
>
>
>
> --
> Sean Mitchuson
> Library Tech Coordinator
> Murray State University
> Murray, Ky
> Phone: 270.809.4773
>