[greenstone-users] Large dynamic collection in Greenstone 2? Afewquestions....

From Cathy Chang
DateFri Dec 17 11:51:17 2010
Subject [greenstone-users] Large dynamic collection in Greenstone 2? Afewquestions....
In-Reply-To (EAE4CF49706B4B4A81885A4F78B660CE-DSC46)
Thanks very much for your suggestion, Diego and Sharyn! I install Greenstone2.83 local library on window both in my laptop and workstation. Sharyn's suggestion works in my laptop, but not my workstation. It is a little strange. I will try to figure it out.

Cheers,
Cathy

From: Diego Spano [mailto:dspano@anac.gov.ar]
Sent: Friday, 17 December 2010 7:01 a.m.
To: Cathy Chang; sharynwise@optushome.com.au; greenstone-users@list.scms.waikato.ac.nz
Subject: RE: [greenstone-users] Large dynamic collection in Greenstone 2? Afewquestions....

Cathy, it seems that you are trying to use Depositor but running local library (server.exe). I`m rigth?.

Depositor only work when use access your library through a web server, like Apache. Try the embed GS Apache, and it will work.

Cheers
Diego

________________________________
De: Cathy Chang [mailto:Cathy.Chang@wintec.ac.nz]
Enviado el: martes, 14 de diciembre de 2010 23:00
Para: 'dspano@anac.gov.ar'; 'sharynwise@optushome.com.au'; 'greenstone-users@list.scms.waikato.ac.nz'
Asunto: RE: [greenstone-users] Large dynamic collection in Greenstone 2? Afewquestions....
Hi, Diego!

Your experience using depositor is really helpful. I am working on configuring Depositor on Window. After enabling Depositor, I chose one collection to deposit. The error message comes out as follows: Do I got wrong step to configure it? I cannot figure it out. Any advice on configuration? Thanks!

"Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator, admin@example.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.
More information about this error may be available in the server error log."
Cheers,
Cathy


From: greenstone-users-bounces@list.scms.waikato.ac.nz [mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] On Behalf Of Diego Spano
Sent: Wednesday, 15 December 2010 9:21 a.m.
To: sharynwise@optushome.com.au; greenstone-users@list.scms.waikato.ac.nz
Subject: RE: [greenstone-users] Large dynamic collection in Greenstone 2? A fewquestions....

Sharyn, I wil tell you my experience. GS is totally viable for that project.

"The project is ongoing, so it is a requirement that new documents/audio files can be added dynamically at any time. I was thinking of using the Depositor for that functionality and possibly scheduling full builds nightly? "

Sure. Depositor is a good tool to let user archive their files. You must configure Depositor to not build the collection on-line, so you can set cron tasks to do the import/build at nights.


"One problem that occurs to me is that the even the minimal rebuild triggered by the Depositor might become very time-consuming as the collection grows (is there a way of calculating this?). Secondly, will Greenstone scale to handle large numbers of largish audio files (ie approx. an hour's recording per file, totalling possibly hundreds of hours) as well as hundreds (possibly thousands) of documents? Finally can the depositor interface be modified to allow javascript validation of input metadata and if so how?"

You must use Lucene as the indexer engine. Lucene allows to do incremental building so only new documents will be added to an existing index. About large audio file: it is not a problem. From Audio files Greenstone will not extract any text, so the time it takes to import an audio file is the time the operating system takes to copy that file from import folder to archive folder. This kind of files are symply managed as file system objects, there is no conversion process. I think that it takes more time to process a pdf (it is converted to html to extract the text) than an audio file. I have a collection with more than 13.000 pdfs and every night a few ones are added without problems. I also have another collection with documents composed by tiff images and txt files from OCR. Now the collection has more than 700.000 images with a full text index.

Some recomendations:

- Use Linux, not Windows. File system management is better.
- You can separate import, archive and index folder in different disks, so you can have a better performance.
- You can also modify some of the GS processes to avoid copying the files from /archives to /index/assoc. You can create a link from index that points to /archives, and with this modification you will have only one copy of the original file and you will also reduce processing time because you don□t hace to copy those big files again.

Hope this helps.

Diego

Diego Spano
Prodigio Consultores
Capital Federal - Argentina
Tel: (54 11) 5093-5313
http://ar.linkedin.com/in/diegospano
www.prodigioconsultores.com<http://www.prodigioconsultores.com>

________________________________
De: greenstone-users-bounces@list.scms.waikato.ac.nz [mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] En nombre de Sharyn Wise
Enviado el: martes, 14 de diciembre de 2010 15:30
Para: greenstone-users@list.scms.waikato.ac.nz
Asunto: [greenstone-users] Large dynamic collection in Greenstone 2? A fewquestions....
Hi all

I'm looking at whether Greenstone 2 is a viable and scalable solution for building a large dynamic collection. The collection will be composed of data from a three year research project - primarily documents (pdf, doc, email text) and audio files.

The project is ongoing, so it is a requirement that new documents/audio files can be added dynamically at any time. I was thinking of using the Depositor for that functionality and possibly scheduling full builds nightly?

One problem that occurs to me is that the even the minimal rebuild triggered by the Depositor might become very time-consuming as the collection grows (is there a way of calculating this?). Secondly, will Greenstone scale to handle large numbers of largish audio files (ie approx. an hour's recording per file, totalling possibly hundreds of hours) as well as hundreds (possibly thousands) of documents? Finally can the depositor interface be modified to allow javascript validation of input metadata and if so how?

I'd be very interested to hear the community's and Greenstone team's thoughts on these questions, any other potential problems you foresee, and any recommendations. Thanks in advance!
cheers
Sharyn
________________________________

No se encontraron virus en este mensaje.
Comprobado por AVG - www.avg.com<http://www.avg.com>
Versi□n: 10.0.1170 / Base de datos de virus: 426/3315 - Fecha de publicaci□n: 12/14/10
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20101217/4e9652b8/attachment.html