[greenstone-users] Large dynamic collection in Greenstone 2? A fewquestions....

From Diego Spano
DateThu Dec 23 10:23:28 2010
Subject [greenstone-users] Large dynamic collection in Greenstone 2? A fewquestions....
In-Reply-To (AANLkTina1uW0HA-r5X8+WX2-vjVBqbfvLCVyOJVP5P7T-mail-gmail-com)
Sharyn,

with version 2.83 when you move from windows to linux you have to rebuild
the collection because the internal database stores "/" and "" depending on
where do you built the index. But the new version (2.84) will fix that
problem so you can copy from one to the other and the collection will be
ready for new documents to be added.

About depositor. I f you take a look to macro deposit.dm, it has "steps" you
can define. Each step is associated with a script. I think that you can
modify these scripts as you want.

Read this: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.8060
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.8060&rep=rep1&t
ype=pdf> &rep=rep1&type=pdf

It is a document explaining depositor and how it works.

Hope this helps.

Diego Spano

_____

De: sharyn.wise@gmail.com [mailto:sharyn.wise@gmail.com] En nombre de Sharyn
Wise
Enviado el: mi□rcoles, 15 de diciembre de 2010 17:30
Para: Cathy Chang
CC: dspano@anac.gov.ar; greenstone-users@list.scms.waikato.ac.nz
Asunto: Re: [greenstone-users] Large dynamic collection in Greenstone 2? A
fewquestions....


Hi Diego and Cathy

Diego: Thanks so very much for sharing your valuable experience. Would it be
a viable option to build the DL on a windows server and port it to Linux if
performance looks to be becoming an issue?

Also can anyone advise if the depositor page can be modified to include
client side field validation?

Cathy: I had the same problem initially (using greenstone 2.83). You don't
say which Greenstone or Windows versions you are using, but on the page
"2.83 release notes" in the wiki, under "known issues and patches" there is
an alternative library.cgi you can download to prevent this problem
happening under Windows XP (here's the link, but it may be scrubbed:
2.83-windows-library.cgi
<http://wiki.greenstone.org/wiki/gsdoc/patches/2.83-windows-library.cgi> ).
I'm actually running Windows 7, but this file fixed the problem anyway, so
it may be worth a try.. back up first and you've nothing to lose!

cheers
Sharyn


On Wed, Dec 15, 2010 at 12:59 PM, Cathy Chang <Cathy.Chang@wintec.ac.nz>
wrote:


Hi, Diego!

Your experience using depositor is really helpful. I am working on
configuring Depositor on Window. After enabling Depositor, I chose one
collection to deposit. The error message comes out as follows: Do I got
wrong step to configure it? I cannot figure it out. Any advice on
configuration? Thanks!

□Internal Server Error

The server encountered an internal error or misconfiguration and was unable
to complete your request.

Please contact the server administrator, admin@example.com and inform them
of the time the error occurred, and anything you might have done that may
have caused the error.

More information about this error may be available in the server error log.□

Cheers,

Cathy

From: greenstone-users-bounces@list.scms.waikato.ac..nz
<mailto:greenstone-users-bounces@list.scms.waikato.ac.nz>
[mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] On Behalf Of Diego
Spano
Sent: Wednesday, 15 December 2010 9:21 a.m.
To: sharynwise@optushome..com.au <mailto:sharynwise@optushome.com.au> ;
greenstone-users@list.scms.waikato.ac.nz
Subject: RE: [greenstone-users] Large dynamic collection in Greenstone 2? A
fewquestions....

Sharyn, I wil tell you my experience. GS is totally viable for that project.


"The project is ongoing, so it is a requirement that new documents/audio
files can be added dynamically at any time. I was thinking of using the
Depositor for that functionality and possibly scheduling full builds
nightly? "

Sure. Depositor is a good tool to let user archive their files. You must
configure Depositor to not build the collection on-line, so you can set cron
tasks to do the import/build at nights.

"One problem that occurs to me is that the even the minimal rebuild
triggered by the Depositor might become very time-consuming as the
collection grows (is there a way of calculating this?). Secondly, will
Greenstone scale to handle large numbers of largish audio files (ie approx.
an hour's recording per file, totalling possibly hundreds of hours) as well
as hundreds (possibly thousands) of documents? Finally can the depositor
interface be modified to allow javascript validation of input metadata and
if so how?"

You must use Lucene as the indexer engine. Lucene allows to do incremental
building so only new documents will be added to an existing index. About
large audio file: it is not a problem. From Audio files Greenstone will not
extract any text, so the time it takes to import an audio file is the time
the operating system takes to copy that file from import folder to archive
folder. This kind of files are symply managed as file system objects, there
is no conversion process. I think that it takes more time to process a pdf
(it is converted to html to extract the text) than an audio file. I have a
collection with more than 13.000 pdfs and every night a few ones are added
without problems. I also have another collection with documents composed by
tiff images and txt files from OCR. Now the collection has more than 700.000
images with a full text index.

Some recomendations:

- Use Linux, not Windows. File system management is better.

- You can separate import, archive and index folder in different disks, so
you can have a better performance.

- You can also modify some of the GS processes to avoid copying the files
from /archives to /index/assoc. You can create a link from index that points
to /archives, and with this modification you will have only one copy of the
original file and you will also reduce processing time because you don□t
hace to copy those big files again.

Hope this helps.

Diego

Diego Spano
Prodigio Consultores
Capital Federal - Argentina
Tel: (54 11) 5093-5313
http://ar.linkedin.com/in/diegospano
www.prodigioconsultores.com


_____


De: greenstone-users-bounces@list.scms.waikato.ac.nz
<mailto:greenstone-users-bounces@list.scms.waikato..ac.nz>
[mailto:greenstone-users-bounces@list.scms.waikato.ac.nz
<mailto:greenstone-users-bounces@list.scms.waikato.ac..nz> ] En nombre de
Sharyn Wise
Enviado el: martes, 14 de diciembre de 2010 15:30
Para: greenstone-users@list.scms.waikato.ac.nz
Asunto: [greenstone-users] Large dynamic collection in Greenstone 2? A
fewquestions....

Hi all

I'm looking at whether Greenstone 2 is a viable and scalable solution for
building a large dynamic collection.. The collection will be composed of
data from a three year research project - primarily documents (pdf, doc,
email text) and audio files.

The project is ongoing, so it is a requirement that new documents/audio
files can be added dynamically at any time. I was thinking of using the
Depositor for that functionality and possibly scheduling full builds
nightly?

One problem that occurs to me is that the even the minimal rebuild triggered
by the Depositor might become very time-consuming as the collection grows
(is there a way of calculating this?). Secondly, will Greenstone scale to
handle large numbers of largish audio files (ie approx. an hour's recording
per file, totalling possibly hundreds of hours) as well as hundreds
(possibly thousands) of documents? Finally can the depositor interface be
modified to allow javascript validation of input metadata and if so how?

I'd be very interested to hear the community's and Greenstone team's
thoughts on these questions, any other potential problems you foresee, and
any recommendations. Thanks in advance!
cheers
Sharyn


_______________________________________________
greenstone-users mailing list
greenstone-users@list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users

_____

_____

No se encontraron virus en este mensaje.
Comprobado por AVG - www.avg.com
Versi□n: 10.0.1170 / Base de datos de virus: 426/3315 - Fecha de
publicaci□n: 12/14/10

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20101222/788c9e82/attachment-0001.html