[greenstone-users] Large dynamic collection in Greenstone 2? A fewquestions....

From Cathy Chang
DateWed Dec 15 15:00:04 2010
Subject [greenstone-users] Large dynamic collection in Greenstone 2? A fewquestions....
In-Reply-To (F7EA6322DEC6414CB811EFBABD42C216-DSC46)
Hi, Diego!

Your experience using depositor is really helpful. I am working on configuring Depositor on Window. After enabling Depositor, I chose one collection to deposit. The error message comes out as follows: Do I got wrong step to configure it? I cannot figure it out. Any advice on configuration? Thanks!

"Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator, admin@example.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.
More information about this error may be available in the server error log."
Cheers,
Cathy


From: greenstone-users-bounces@list.scms.waikato.ac.nz [mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] On Behalf Of Diego Spano
Sent: Wednesday, 15 December 2010 9:21 a.m.
To: sharynwise@optushome.com.au; greenstone-users@list.scms.waikato.ac.nz
Subject: RE: [greenstone-users] Large dynamic collection in Greenstone 2? A fewquestions....

Sharyn, I wil tell you my experience. GS is totally viable for that project.

"The project is ongoing, so it is a requirement that new documents/audio files can be added dynamically at any time. I was thinking of using the Depositor for that functionality and possibly scheduling full builds nightly? "

Sure. Depositor is a good tool to let user archive their files. You must configure Depositor to not build the collection on-line, so you can set cron tasks to do the import/build at nights.


"One problem that occurs to me is that the even the minimal rebuild triggered by the Depositor might become very time-consuming as the collection grows (is there a way of calculating this?). Secondly, will Greenstone scale to handle large numbers of largish audio files (ie approx. an hour's recording per file, totalling possibly hundreds of hours) as well as hundreds (possibly thousands) of documents? Finally can the depositor interface be modified to allow javascript validation of input metadata and if so how?"

You must use Lucene as the indexer engine. Lucene allows to do incremental building so only new documents will be added to an existing index. About large audio file: it is not a problem. From Audio files Greenstone will not extract any text, so the time it takes to import an audio file is the time the operating system takes to copy that file from import folder to archive folder. This kind of files are symply managed as file system objects, there is no conversion process. I think that it takes more time to process a pdf (it is converted to html to extract the text) than an audio file. I have a collection with more than 13.000 pdfs and every night a few ones are added without problems. I also have another collection with documents composed by tiff images and txt files from OCR. Now the collection has more than 700.000 images with a full text index.

Some recomendations:

- Use Linux, not Windows. File system management is better.
- You can separate import, archive and index folder in different disks, so you can have a better performance.
- You can also modify some of the GS processes to avoid copying the files from /archives to /index/assoc. You can create a link from index that points to /archives, and with this modification you will have only one copy of the original file and you will also reduce processing time because you donâ–ˇt hace to copy those big files again.

Hope this helps.

Diego

Diego Spano
Prodigio Consultores
Capital Federal - Argentina
Tel: (54 11) 5093-5313
http://ar.linkedin.com/in/diegospano
www.prodigioconsultores.com<http://www.prodigioconsultores.com>

________________________________
De: greenstone-users-bounces@list.scms.waikato.ac.nz [mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] En nombre de Sharyn Wise
Enviado el: martes, 14 de diciembre de 2010 15:30
Para: greenstone-users@list.scms.waikato.ac.nz
Asunto: [greenstone-users] Large dynamic collection in Greenstone 2? A fewquestions....
Hi all

I'm looking at whether Greenstone 2 is a viable and scalable solution for building a large dynamic collection. The collection will be composed of data from a three year research project - primarily documents (pdf, doc, email text) and audio files.

The project is ongoing, so it is a requirement that new documents/audio files can be added dynamically at any time. I was thinking of using the Depositor for that functionality and possibly scheduling full builds nightly?

One problem that occurs to me is that the even the minimal rebuild triggered by the Depositor might become very time-consuming as the collection grows (is there a way of calculating this?). Secondly, will Greenstone scale to handle large numbers of largish audio files (ie approx. an hour's recording per file, totalling possibly hundreds of hours) as well as hundreds (possibly thousands) of documents? Finally can the depositor interface be modified to allow javascript validation of input metadata and if so how?

I'd be very interested to hear the community's and Greenstone team's thoughts on these questions, any other potential problems you foresee, and any recommendations. Thanks in advance!
cheers
Sharyn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20101215/c2522201/attachment-0001.html