|Coping with very large digital collections using Greenstone : Stefan Boddie, John Thompson, David Bainbridge, Ian H. Witten|
In this article we have described how the open source Greenstone digital library software has been utilized to support two substantial national newspaper projects: one in New Zealand and the other in Singapore. Each currently contain over half a million OCR’d pages and are on track to be scaled into the multi-million range.
Full-text indexing is included in both. Despite OCR software advertising accuracy rates of “99.9%” this is typically quoted at the character level—the error rate at the word level is much higher—and assumes high quality images. The images in our collections are on average over a century years old and restored from microfilm and microfiche, consequently the image quality is significantly poorer, and compounds the problem of error rates. Working with OCR text we have found that the vocabulary size essentially grows linearly with collection size, giving it the potential to become a stress point in the digital library system, however for the current (and projected) scale of operation the indexing software coped admirably.
To operate at this scale, Greenstone software developments were actually modest. First, the existing multi-indexer framework it provides was fine-tuned in its support for Lucene, and second, the standard flat-file database used (GDBM) was extended to support multiple instances to surpass a 2GB per-file limit. Testing showed this arrangement more than adequately satisfied the needs of these newspaper projects.
With an eye for future development, we then took things one step further, and analyzed Greenstone using DB2, an alternative database solution that in principle is better geared to scale well in a distributed fashion. While initial results in terms of performance were disappointing in comparison to the status quo, the work was exploratory (precisely to find out how well it performed “out of the box”) and further work is schedule to look at ways to enhance performance for our given application. Not withstanding the DB2 outcome, in doing this work, Greenstone now supports a multi-database framework that is complementary to its indexer framework permitting the addition of other database systems such as Oracle and Mysql.