[greenstone-users] Large dynamic collection in Greenstone 2? Afewquestions....

From Diego Spano
DateFri Dec 17 07:03:59 2010
Subject [greenstone-users] Large dynamic collection in Greenstone 2? Afewquestions....
In-Reply-To (6841B7909F46A041B5AA50684134353D0364F246D3F9-EXCHMBCLUS1-ds1-prod)
Cathy, it seems that you are trying to use Depositor but running local
library (server.exe). I`m rigth?.

Depositor only work when use access your library through a web server, like
Apache. Try the embed GS Apache, and it will work.

Cheers
Diego

_____

De: Cathy Chang [mailto:Cathy.Chang@wintec.ac.nz]
Enviado el: martes, 14 de diciembre de 2010 23:00
Para: 'dspano@anac.gov.ar'; 'sharynwise@optushome.com.au';
'greenstone-users@list.scms.waikato.ac.nz'
Asunto: RE: [greenstone-users] Large dynamic collection in Greenstone 2?
Afewquestions....

Hi, Diego!

Your experience using depositor is really helpful. I am working on
configuring Depositor on Window. After enabling Depositor, I chose one
collection to deposit. The error message comes out as follows: Do I got
wrong step to configure it? I cannot figure it out. Any advice on
configuration? Thanks!

□Internal Server Error

The server encountered an internal error or misconfiguration and was unable
to complete your request.

Please contact the server administrator, admin@example.com and inform them
of the time the error occurred, and anything you might have done that may
have caused the error.

More information about this error may be available in the server error log.□

Cheers,

Cathy

From: greenstone-users-bounces@list.scms.waikato.ac.nz
[mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] On Behalf Of Diego
Spano
Sent: Wednesday, 15 December 2010 9:21 a.m.
To: sharynwise@optushome.com.au; greenstone-users@list.scms.waikato.ac.nz
Subject: RE: [greenstone-users] Large dynamic collection in Greenstone 2? A
fewquestions....

Sharyn, I wil tell you my experience. GS is totally viable for that project.


"The project is ongoing, so it is a requirement that new documents/audio
files can be added dynamically at any time. I was thinking of using the
Depositor for that functionality and possibly scheduling full builds
nightly? "

Sure. Depositor is a good tool to let user archive their files. You must
configure Depositor to not build the collection on-line, so you can set cron
tasks to do the import/build at nights.

"One problem that occurs to me is that the even the minimal rebuild
triggered by the Depositor might become very time-consuming as the
collection grows (is there a way of calculating this?). Secondly, will
Greenstone scale to handle large numbers of largish audio files (ie approx.
an hour's recording per file, totalling possibly hundreds of hours) as well
as hundreds (possibly thousands) of documents? Finally can the depositor
interface be modified to allow javascript validation of input metadata and
if so how?"

You must use Lucene as the indexer engine. Lucene allows to do incremental
building so only new documents will be added to an existing index. About
large audio file: it is not a problem. From Audio files Greenstone will not
extract any text, so the time it takes to import an audio file is the time
the operating system takes to copy that file from import folder to archive
folder. This kind of files are symply managed as file system objects, there
is no conversion process. I think that it takes more time to process a pdf
(it is converted to html to extract the text) than an audio file. I have a
collection with more than 13.000 pdfs and every night a few ones are added
without problems. I also have another collection with documents composed by
tiff images and txt files from OCR. Now the collection has more than 700.000
images with a full text index.

Some recomendations:

- Use Linux, not Windows. File system management is better.

- You can separate import, archive and index folder in different disks, so
you can have a better performance.

- You can also modify some of the GS processes to avoid copying the files
from /archives to /index/assoc. You can create a link from index that points
to /archives, and with this modification you will have only one copy of the
original file and you will also reduce processing time because you don□t
hace to copy those big files again.

Hope this helps.

Diego

Diego Spano
Prodigio Consultores
Capital Federal - Argentina
Tel: (54 11) 5093-5313
http://ar.linkedin.com/in/diegospano
www.prodigioconsultores.com

_____

De: greenstone-users-bounces@list.scms.waikato.ac.nz
[mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] En nombre de
Sharyn Wise
Enviado el: martes, 14 de diciembre de 2010 15:30
Para: greenstone-users@list.scms.waikato.ac.nz
Asunto: [greenstone-users] Large dynamic collection in Greenstone 2? A
fewquestions....

Hi all

I'm looking at whether Greenstone 2 is a viable and scalable solution for
building a large dynamic collection. The collection will be composed of data
from a three year research project - primarily documents (pdf, doc, email
text) and audio files.

The project is ongoing, so it is a requirement that new documents/audio
files can be added dynamically at any time. I was thinking of using the
Depositor for that functionality and possibly scheduling full builds
nightly?

One problem that occurs to me is that the even the minimal rebuild triggered
by the Depositor might become very time-consuming as the collection grows
(is there a way of calculating this?). Secondly, will Greenstone scale to
handle large numbers of largish audio files (ie approx. an hour's recording
per file, totalling possibly hundreds of hours) as well as hundreds
(possibly thousands) of documents? Finally can the depositor interface be
modified to allow javascript validation of input metadata and if so how?

I'd be very interested to hear the community's and Greenstone team's
thoughts on these questions, any other potential problems you foresee, and
any recommendations. Thanks in advance!
cheers
Sharyn

_____

No se encontraron virus en este mensaje.
Comprobado por AVG - www.avg.com
Versi□n: 10.0.1170 / Base de datos de virus: 426/3315 - Fecha de
publicaci□n: 12/14/10

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20101216/edd4008c/attachment-0001.html