1 Introduction
close this book View the PDF document Coping with very large digital collections using Greenstone : Stefan Boddie, John Thompson, David Bainbridge, Ian H. Witten
View the document 2 Building the Papers Past collection
View the document 3 Distributed operation with IBM’s DB2
View the document 4 Summary and Conclusions
View the document References

4 Summary and Conclusions

In this article we have described how the open source Greenstone digital library software has been utilized to support two substantial national newspaper projects: one in New Zealand and the other in Singapore. Each currently contain over half a million OCR’d pages and are on track to be scaled into the multi-million range.

Full-text indexing is included in both. Despite OCR software advertising accuracy rates of “99.9%” this is typically quoted at the character levelthe error rate at the word level is much higherand assumes high quality images. The images in our collections are on average over a century years old and restored from microfilm and microfiche, consequently the image quality is significantly poorer, and compounds the problem of error rates. Working with OCR text we have found that the vocabulary size essentially grows linearly with collection size, giving it the potential to become a stress point in the digital library system, however for the current (and projected) scale of operation the indexing software coped admirably.

To operate at this scale, Greenstone software developments were actually modest. First, the existing multi-indexer framework it provides was fine-tuned in its support for Lucene, and second, the standard flat-file database used (GDBM) was extended to support multiple instances to surpass a 2GB per-file limit. Testing showed this arrangement more than adequately satisfied the needs of these newspaper projects.

With an eye for future development, we then took things one step further, and analyzed Greenstone using DB2, an alternative database solution that in principle is better geared to scale well in a distributed fashion. While initial results in terms of performance were disappointing in comparison to the status quo, the work was exploratory (precisely to find out how well it performed “out of the box”) and further work is schedule to look at ways to enhance performance for our given application. Not withstanding the DB2 outcome, in doing this work, Greenstone now supports a multi-database framework that is complementary to its indexer framework permitting the addition of other database systems such as Oracle and Mysql.