From | Cathy Chang |
Date | Wed Dec 15 15:00:04 2010 |
Subject | [greenstone-users] Large dynamic collection in Greenstone 2? A fewquestions.... |
In-Reply-To | (F7EA6322DEC6414CB811EFBABD42C216-DSC46) |
Hi, Diego!
Your experience using depositor is really helpful. I am working on configuring Depositor on Window. After enabling Depositor, I chose one collection to deposit. The error message comes out as follows: Do I got wrong step to configure it? I cannot figure it out. Any advice on configuration? Thanks! "Internal Server Error
Sharyn, I wil tell you my experience. GS is totally viable for that project. "The project is ongoing, so it is a requirement that new documents/audio files can be added dynamically at any time. I was thinking of using the Depositor for that functionality and possibly scheduling full builds nightly? " Sure. Depositor is a good tool to let user archive their files. You must configure Depositor to not build the collection on-line, so you can set cron tasks to do the import/build at nights.
You must use Lucene as the indexer engine. Lucene allows to do incremental building so only new documents will be added to an existing index. About large audio file: it is not a problem. From Audio files Greenstone will not extract any text, so the time it takes to import an audio file is the time the operating system takes to copy that file from import folder to archive folder. This kind of files are symply managed as file system objects, there is no conversion process. I think that it takes more time to process a pdf (it is converted to html to extract the text) than an audio file. I have a collection with more than 13.000 pdfs and every night a few ones are added without problems. I also have another collection with documents composed by tiff images and txt files from OCR. Now the collection has more than 700.000 images with a full text index. Some recomendations: - Use Linux, not Windows. File system management is better.
Hope this helps. Diego Diego Spano
________________________________
I'm looking at whether Greenstone 2 is a viable and scalable solution for building a large dynamic collection. The collection will be composed of data from a three year research project - primarily documents (pdf, doc, email text) and audio files. The project is ongoing, so it is a requirement that new documents/audio files can be added dynamically at any time. I was thinking of using the Depositor for that functionality and possibly scheduling full builds nightly? One problem that occurs to me is that the even the minimal rebuild triggered by the Depositor might become very time-consuming as the collection grows (is there a way of calculating this?). Secondly, will Greenstone scale to handle large numbers of largish audio files (ie approx. an hour's recording per file, totalling possibly hundreds of hours) as well as hundreds (possibly thousands) of documents? Finally can the depositor interface be modified to allow javascript validation of input metadata and if so how? I'd be very interested to hear the community's and Greenstone team's thoughts on these questions, any other potential problems you foresee, and any recommendations. Thanks in advance!
|