Re: [greenstone-users] Migration of Existing Digital Library toGreenstone

From Stefan Boddie
DateTue, 31 Oct 2006 10:47:40 +1300
Subject Re: [greenstone-users] Migration of Existing Digital Library toGreenstone
In-Reply-To (45462B6D-9020709-qeh-ox-ac-uk)
Hi Mike,
At Forced Migration Online we are about to make a funding application to expand our Forced Migration Online Digital Library ( This currently consists of 9715 items (total of approx. 220,000 pages). These items are comprised of:

a) 4066 full text documents
b) 5649 full text articles from 5 Journals

Searching is possible in both the full text of the items, and in their metadata. The source material for the current library was a mixture of either born digital PDFs or scanned + OCR'd paper originals.

The Digital Library is currently driven by an application from a company called Olive Software ( and is a proprietary solution. To enhance / expand our Library for the long term (10 years +) we first want to migrate it to an Open Source solution ? and we think Greenstone is that solution.

We don?t know a great deal about Greenstone, and I was wondering whether I could ask some questions about it so that we are better informed before making our funding application.


1. The big question is how we perform the migration.

Currently in our Digital Library, each page of content (whether part of a Document or Journal Article) consists of:

.png file (page image)
full text of the page (XML) derived from the OCR
metadata (XML) describing the structure of the page

Each document / journal article has structural metadata, plus keywords for browsing / searching, again all XML.

Because of the size of the migration task, we were wondering whether you know of any third parties that undertake this kind of work?
Um, yes, we do ( Please email me off list if you're interested in having us help you with this.

To answer the question though, the best approach for migrating your data is probably to write a customized Greenstone plugin for importing the PNG images and XML files that make up your data.


2. Can Greenstone handle searching in both full-text and metadata?


3. Would the full text searching be done on the OCR (or would we have to budget for re-keying the text)?
Yes, Greenstone can certainly search your OCR'ed text. Keep in mind though that the search is only as good as the text you index so the more accurate the text is the better. Having said that, we've built many collections using text from OCR and searching is quite reasonable (and presumably the existing Olive-based system searches the OCR'ed text, right?)


4. As well as making this library available on-line, some users will need to be able to access subsets of it off-line, since they will have little or no Internet access. Can Greenstone export a subset of a library to CD-ROM / DVD-ROM? And would such a portable version have full-text / metadata searching functionality available?
We've done something similar to this for another client, and we effectively just build a version of the collection that includes only the subset of documents that they require on the CD-ROM. When run from CD-ROM/DVD Greenstone looks and operates exactly as it does on the web, including full-text and metadata searching.


5. Once the migration stage was complete we would add new material ? either scanned + OCRd or in many cases from PDF originals. We create metadata at the time of scanning in XML. Would there be any automated way to add this to the uploaded items, or would it be a matter of a large cut and paste exercise using the Librarian interface?
I'd recommend developing a plugin for importing the metadata directly from your XML files, as mentioned above. To add new content you'd then simply drop the new data in the source directory, along with the old data, then rebuild the collection.


6. We can host the future library under either Linux or Windows ? are there any particular advantages to either platform?
Greenstone is a simple CGI application so it'll run fine under most any operating system and web server. My own preference is for linux and apache, as they're nice and stable, and in the past Windows has perhaps not been quite so stable. There's little between them these days as web server platforms though I don't think, but you might take some of the following into account when making your decision.
(a) Who will be installing Greenstone, building/rebuilding the collection, etc., and how? We typically build Greenstone collections from the command line, and a linux-based server makes it nice and easy to log in remotely to do that.
(b) What existing IT systems do you have? If you have primarily a Windows-based system the benefits of a linux server may be outweighed by the additional headache of maintaining it.

I hope some of this helps.

Stefan Boddie
DL Consulting Ltd.