|Coping with very large digital collections using Greenstone : Stefan Boddie, John Thompson, David Bainbridge, Ian H. Witten|
The Papers Past collection of the NZ
National Library, which can be viewed at http://paperspast.natlib.govt.nz,
contains approximately 1.1 million pages of historic newspapers. A similar
collection has been built for the National Library Board of Singapore. This
currently contains approximately 600,000 newspaper pages, and will grow to more
than two million. This section describes Papers
Past, but except where noted the structure of the
Text. About 600,000 pages have been processed by optical character recognition (OCR) software to date for the Papers Past collection. These contain 2.1 billion words; 17.25 GB of raw text. The images were digitized from microfilm, but were of poor quality—as were the paper originals—which inevitably resulted in poor performance of the OCR software.
The high incidence of recognition errors yields a much larger number of unique terms than would normally be expected. The 2.1 billion words of running text include 59 million unique terms (2.8%). Our tests show that this will continue to increase linearly as further content is added. By contrast, clean English text typically contains no more than a few hundred thousand terms, and even dramatic increases in size add a relatively small number of new terms. As a point of comparison, the Google collection of n-grams on the English web  is drawn from 500 times as many words (1,025 billion) but contains less than a quarter the number of different words (13.6 million, or 0.0013%). However, words that appear less than 40 times were omitted from this collection, a luxury that we did not have with Papers Past because of the importance of rarely-used place and personal names for information retrieval.
Enormous vocabularies challenge search engine performance . Moreover, the high incidence of errors makes it desirable to offer an approximate search capability. Unfortunately neither MG nor MGPP provide approximate searching. Instead we decided to use the Lucene indexer, because of its fuzzy search feature and proven scalability—it has been tested on collections of more than 100 GB of raw text. Consequently we worked on improving and extending the experimental support for Lucene available through Greenstone.
Metadata. Papers Past involves a massive amount of metadata. Its specification demands that newspaper articles be viewable individually as well as in their original context on the page. The 600,000 OCR’d pages comprise 6.5 million individual articles, each with its own metadata. The physical coordinates that specify an article’s position on the page are stored as metadata to allow article-level images to be clipped from pages.
A further requirement is that search terms be highlighted directly within images. In order to do so the bounding-box coordinates of each and every word in the collection must be stored. These word coordinates represent 49 GB of metadata—nearly three times as large as the collection’s full text. Putting together all the article, page, and issue information yields a total of 52.3 GB of metadata.
The collection’s source files are bi-tonal digital images in TIFF format. From these images, the OCR process generates METS/ALTO XML representation . This includes all the word and article bounding-box coordinates, as well as the full text, article-level metadata and issue-level metadata. The resulting source data include 91,545 METS files, one per newspaper issue, and 601,516 ALTO files, one per newspaper page. Together, these amount to a total of 570 GB of XML, slightly under 1 MB per page. All this XML is imported into Greenstone, the text is indexed with Lucene, and the metadata and bounding-box coordinates are stored by Greenstone in a database.
From the very beginning, Greenstone has used the GNU database management system (GDBM) for storing and retrieving metadata. It is fast and reliable, and does this simple job very well. Crucially—and of particular importance for librarian end-users—GDBM can be installed on Windows, Linux and Macintosh computers without requiring any special configuration. However, the design of Papers Past exposed limitations. GDBM files are restricted to 2 GB; moreover, look-up performance degrades noticeably once the database exceeds 500 MB.
A simple extension was to modify Greenstone to make it automatically spawn new databases as soon as the existing one exceeds 400 MB: currently, Papers Past uses 188 of them. With this multiple database system Greenstone can retrieve word coordinates and other metadata very quickly, even when running on modest hardware.
Images. The archival master files for Papers
Past are compressed bi-tonal TIFF images averaging 770 kB each. The full
collection of 1.1 million images occupies 830 GB. The
A goal of both projects is that no special viewer software is required other than a modern web browser: users should not need browser plug-ins or downloads. However, neither TIFF nor JPEG 2000 are supported natively by all contemporary browsers. Consequently the source images are converted to a web friendly format before being delivered. Also, processing is required to reduce image file size for download, and in the case of individual articles, to clip the images from their surrounding context.
Greenstone normally pre-processes all images when the collection is built, stores the processed versions, and serves them to the user as required. However, it would take nearly two weeks to pre-process the 1.1 million pages of Papers Past, at a conservative estimate of 1 page/sec. To clip out all 6.5 million articles and save them as pre-prepared web images would take a further 15 days, assuming the same rate. The preprocessed images would consume a great deal of additional storage. Furthermore, if in future it became necessary to change the size or resolution of the web images they would all need to be re-processed. Hence it was decided to build an image server to convert archival source images to web accessible versions on demand, and maintain a cache of these for frequently viewed items.
A major strength of Greenstone is its ability to perform well on modest hardware. The core software was designed run on everything from powerful servers to elderly Windows 95/98 and even obsolete Windows 3.1/3.11 machines, which were still prevalent in developing countries. This design philosophy has proven immensely valuable in supporting large collections under heavy load. Greenstone is fast and responsive and, on modern hardware, can service a large number of concurrent users. However, the image server is computation-intensive and consumes significant system resources, though the overall system can cope with moderately heavy loads on a single modern quad-core server, as is deployed at the National Library of Singapore.
Building the collection. It takes significant time to ingest these large collections into Greenstone, even without the need to pre-process the source images. Greenstone’s ingest procedure consists of two phases. The first, called importing, converts all the METS/ALTO data into Greenstone’s own internal, canonical XML format. The second, called building, parses the XML data and creates the Lucene search index and the GDBM metadata databases.
The import phase processes approximately 100 pages/min on a dual quad-core Xeon processor with 4 GB of main memory, taking just over four days to import 600,000 pages. This stage of the ingest procedure may be run in batches and spread over multiple servers if necessary. The building phase processes approximately 300 pages/min on the same kind of processor, taking around 33 hours to build the collection. Currently, the second step cannot be executed incrementally, and the complete collection must be re-built (but not re-imported) whenever data is added.
Future improvements. The problem of large-scale searching is exacerbated by the presence of OCR errors. An obvious solution is to remove errors prior to indexing . However, we decided not to attempt automatic correction at this stage, in order to avoid introducing yet more errors. In our environment this solution is workable so long as Lucene continues to perform adequately on uncorrected text. However, we do plan to investigate the possibility of eliminating the worst of the OCR errors.
A long-standing deficiency of Greenstone is the need to rebuild collections from scratch when documents are added, modified, or deleted. This limitation arose because the original MG indexer was non-incremental—it was optimized for maximum compression, which requires non-incremental operation to ensure that the word statistics on which compression is based reflect the collection as a whole. However, Lucene, which Greenstone now incorporates as an option, is capable of operating incrementally. Today, the only bar to incremental operation concerns the way that the GDBM metadata database is updated and accessed. We are working to allow this to be easily replaced by alternative databases, just as users can select MG, MGPP, or Lucene as indexers when building a collection. As part of this work we will make Greenstone collections updatable incrementally, avoiding the necessity to rebuild the search index and metadata database as the collection evolves.
This will improve scalability of the building process. However, the only way to obtain arbitrary scalability of the run-time system is to distribute it over multiple servers. It is already possible for the image server and the core Greenstone system to run on separate computers; indeed, multiple image servers can easily be used. But for true scalability the search index and metadata databases must also be distributed.