[greenstone-users] File size restrctions?

From Sean Mitchuson
DateFri Jul 29 02:24:32 2011
Subject [greenstone-users] File size restrctions?
Thanks for that information sadly the collection seems to be gone so I can't
try this method. We've had a new issue arise though. We are trying to
upload large pdf files into Greenstone (around 75 megs) and either the
upload doesn't take or this happens when building and importing

import.pl> perl -S incremental-import.pl -collectdir
"E:Greenstonecollect" -gli -language "en" dullrich-yearbo40 2>&1
import.pl> *****
import.pl> First time import. Switching to full import.pl.
import.pl> *****
import.pl> UnknownPlugin Warning: Non-recursive plugin has no process_exp
import.pl> Removing current contents of the archives directory...
import.pl> Removing contents of the collection "tmp" directory...
import.pl> Global file scan checking directory:
E:Greenstonecollectdullrich-yearbo40import
import.pl>
import.pl> MetadataXMLPlugin: processing metadata.xml
import.pl> Converting test2.pdf to HTML format
import.pl> Error executing pdftohtml.pl
import.pl> pdftohtml error log:
import.pl> Error: PDF version 1.6 -- xpdf supports version 1.4
(continuing anyway)
import.pl> Error (0): PDF file is damaged - attempting to reconstruct
xref table...
import.pl> Error: Couldn't find trailer dictionary
import.pl> Error: Couldn't read xref table
import.pl> Could not convert test2.pdf to HTML format
import.pl> Error: PDF version 1.6 -- xpdf supports version 1.4
(continuing anyway)
import.pl> Error (0): PDF file is damaged - attempting to reconstruct
xref table...
import.pl> Error: Couldn't find trailer dictionary
import.pl> Error: Couldn't read xref table
import.pl>
import.pl> WARNING: No plugin could process test2.pdf
import.pl>
import.pl> *********************************************
import.pl> Import complete
import.pl> *********************************************
import.pl> * 1 document was considered for processing
import.pl> * 0 were processed and included in the collection
import.pl> * 1 was rejected
import.pl> See E:Greenstonecollectdullrich-yearbo40etcfail.log
for a list of unrecognised and/or rejected documents
import.pl> Extracting new metadata from archive files.
import.pl> Archived metadata extraction complete.
buildcol.pl> perl -S full-buildcol.pl -collectdir
"E:Greenstonecollect" -gli -language "en" dullrich-yearbo40 2>&1
buildcol.pl> UnknownPlugin Warning: Non-recursive plugin has no process_exp
buildcol.pl>
buildcol.pl> *** creating the compressed text
buildcol.pl>
buildcol.pl> collecting text statistics (mgpp_passes -T1)
buildcol.pl> WARNING: No plugin could recognise archiveinf-src.gdb
buildcol.pl> Stats (Compressing text from text)
buildcol.pl> Total bytes in collection: 0
buildcol.pl> Total bytes in text: 0
buildcol.pl> ***************
buildcol.pl> WARNING: There is very little or no text to compress
buildcol.pl> Was this your intention?
buildcol.pl> ***************
buildcol.pl>
buildcol.pl> creating the compression dictionary
buildcol.pl>
buildcol.pl> compressing the text (mgpp_passes -T2)
buildcol.pl> WARNING: No plugin could recognise archiveinf-src.gdb
buildcol.pl> Stats (Compressing text from text)
buildcol.pl> Total bytes in collection: 0
buildcol.pl> Total bytes in text: 0
buildcol.pl> ***************
buildcol.pl> WARNING: There is very little or no text to compress
buildcol.pl> Was this your intention?
buildcol.pl> ***************
buildcol.pl>
buildcol.pl> *** building index text;dc.Title,ex.Title;ex.Source; in
subdirectory idx
buildcol.pl>
buildcol.pl> creating index dictionary (mgpp_passes -I1)
buildcol.pl> WARNING: No plugin could recognise archiveinf-src.gdb
buildcol.pl> Stats (Creating index text;dc.Title,ex.Title;ex.Source;)
buildcol.pl> Total bytes in collection: 0
buildcol.pl> Total bytes in text;dc.Title,ex.Title;ex.Source;: 0
buildcol.pl> ***************
buildcol.pl> WARNING: There is very little or no text to process for
text;dc.Title,ex.Title;ex.Source;
buildcol.pl> Was this your intention?
buildcol.pl> ***************
buildcol.pl>
buildcol.pl> inverting the text (mgpp_passes -I2)
buildcol.pl> WARNING: No plugin could recognise archiveinf-src.gdb
buildcol.pl> Stats (Creating index text;dc.Title,ex.Title;ex.Source;)
buildcol.pl> Total bytes in collection: 0
buildcol.pl> Total bytes in text;dc.Title,ex.Title;ex.Source;: 0
buildcol.pl> ***************
buildcol.pl> WARNING: There is very little or no text to process for
text;dc.Title,ex.Title;ex.Source;
buildcol.pl> Was this your intention?
buildcol.pl> ***************
buildcol.pl>
buildcol.pl> create the weights file
buildcol.pl>
buildcol.pl> creating 'on-disk' stemmed dictionary
buildcol.pl>
buildcol.pl> creating stem indexes
buildcol.pl> BuildDir: E:/Greenstone/collect/dullrich-yearbo40/building
buildcol.pl>
buildcol.pl> *** creating the info database and processing associated files
buildcol.pl> WARNING: No plugin could recognise archiveinf-src.gdb
buildcol.pl> Use of uninitialized value in string eq at
E:Greenstone/perllib/mgppbuilder.pm line 624.
buildcol.pl> Warning: No metadata values assigned to dc.Title;ex.Title.
buildcol.pl> *** outputting information for classifier: CL1
buildcol.pl> Warning: No metadata values assigned to ex.Source.
buildcol.pl> *** outputting information for classifier: CL2
buildcol.pl> *** outputting information for classifier: oai
buildcol.pl>
buildcol.pl> *** creating auxiliary files

On Wed, Jul 27, 2011 at 9:46 PM, Greenstone Team <
greenstone_team@cs.waikato.ac.nz> wrote:

> Hi Sean,
>
> Sam here suspects that it could be due to too many file handles being held
> open. It may be a known issue with GLI that it can't handle collections that
> are too large.
> Can you try to access the server machine and rebuild the collection from
> the command line on there (see below for instructions)? Does that work?
>
> Instructions on rebuilding the collection from the command-line. Read the
> following through the end first, before trying it out.
>
> 1. Try to ssh into the remote machine where the GS server lives, or
> otherwise try to gain direct access to the machine.
>
> 2. Stop your Greenstone 2 web server.
>
> 3. Open up a terminal (like an x-term on Linux, DOS prompt on Windows) and
> cd into your Greenstone installation folder. Note that the ">" angle bracket
> represents a new line of your waiting command prompt (don't type it):
> > cd "C:Program FilesGreenstone2"
>
> 4. Next, run the setup script to setup Greenstone's environment.
> On Windows:
> > setup.bat
>
> On Linux:
> > source setup.bash
>
> 5. First, decide on whether you want to try incremental building in an
> attempt to save some time, or whether you think your collection may have
> become corrupted and you require a proper rebuild. Your collection is very
> huge, and so time-saving measures are something to consider:
>
> (i) If you want to try incremental building, then after each ".pl" below,
> type the word "-incremental" (without quotes) before or after the word
> "-keepold" already in ALL the commands in step 6 below. Make sure to put a
> space or more between -incremental and -keepold.
> (ii) If you suspect your index folder is corrupted and incremental building
> can't fix the fundamental flaws, but you hopefully anticipate that your
> archives folder may have survived intact, just leave the "-keepold" flag in
> (don't add in "-incremental"). No need to change any of the commands in step
> 6.
> (iii) If you think even your collection's archives folder and not just its
> index folder may have become corrupted, *replace* the "-keepold" flag in ALL
> the commands in step 6 below with "-removeold". (But again, don't add any
> "-incremental").
>
> I would go with (iii), but only AFTER moving your collection's current
> "index" and "archives" folders out of the way, to keep some sort of backup
> of them. (Your collection's "index" and "archives" folders are located in
> your GS2'installations collect/<collection name> directory). Moving them
> elsewhere will also tell if your OS is holding a lock on any index files,
> since Windows often does this and that can break the building process.
> Option (iii) may take the longest but at least you'd have tried it all in
> one go.
>
>
> 6. Now, you are ready to start the 3 step manual collection building
> process:
>
> a. IMPORTING
>
> On Windows:
> > perl -S import.pl -keepold <type your collection's name here>
>
> On Linux:
> > import.pl -keepold <type your collection's name here>
> (If that didn't work, plug the word "perl" in front of the Linux command).
>
> NOTE: If your collect folder is located elsewhere, add in the -collectdir
> flag to the command and provide the full path to your non-standard "collect"
> directory as follows:
> > perl -S import.pl -collectdir "full/path/to/your/external/**collect"
> -keepold <type your collection's name here>
> Or on Linux:
> > import.pl -collectdir "full/path/to/your/external/**collect" -keepold
> <type your collection's name here>
>
> It's likely the above will spend a long time trying to import your 14Gb
> worth of documents. Once that's done at last, the prompt will return to you.
> At which stage you need to perform the next stage:
>
> b. BUILDING
>
> On Windows:
> > perl -S buildcol.pl -keepold <type your collection's name here>
>
> On Linux:
> > buildcol.pl -keepold <type your collection's name here>
> (If that didn't work, plug the word "perl" in front of the Linux command)
>
> Once again, if your collect directory is different from the standard GS2
> "collect" folder, additionally specify the -collectdir <"full path to your
> collect folder"> option to the buildcol command.
>
> It may take a very long time again to build your collection. But if it
> succeeds, you can move onto the 3rd stage of the rebuilding process:
>
> c. MOVING FOLDERS "BUILDING" TO "INDEX"
>
> Rebuilding manually from the command-line generates a folder called
> "building" inside your collect/<collection-name folder>. If you see any
> folder called "index" in here, then move it far out of the way (or delete
> it, if you feel confident). Then rename "building" to "index".
> While GLI does this step automatically for you, manual rebuilding does not.
>
>
> 7. If you saw no errors during any stage of the rebuilding process of step
> 6, it's a fair indication that things were okay. But to make fully sure,
> restart your GS2 web server and visit its home page and then go to your
> rebuilt collection and see if it still works.
>
>
> Write back if you encounter any error messages during step 6 or anything
> that goes visibly wrong in step 7 (or any of the steps).
>
> All the best,
> Anupama
>
>
> Sean Mitchuson wrote:
>
>> We have been working on a collection that is around 14gb worth of data and
>> is mostly pdf files. Recently after a upload session we can no longer
>> access the collection. Every time we try to open it through the GLI it sits
>> and waits for minutes (up to 20 at last check) and then gives us a 500 error
>> for gliserver.pl <http://gliserver.pl>
>> Is this collection ruined? Or is there a way to save it?
>> Thanks,
>>
>> --
>> Sean Mitchuson
>> Library Tech Coordinator
>> Murray State University
>> Murray, Ky
>> Phone: 270.809.4773
>>
>> ------------------------------**------------------------------**
>> ------------
>>
>> ______________________________**_________________
>> greenstone-users mailing list
>> greenstone-users@list.scms.**waikato.ac.nz<greenstone-users@list.scms.waikato.ac.nz>
>> https://list.scms.waikato.ac.**nz/mailman/listinfo/**greenstone-users<https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users>
>>
>>
>
>


--
Sean Mitchuson
Library Tech Coordinator
Murray State University
Murray, Ky
Phone: 270.809.4773
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20110728/9dd23029/attachment-0001.html