1 Introduction
close this book View the PDF document Coping with very large digital collections using Greenstone : Stefan Boddie, John Thompson, David Bainbridge, Ian H. Witten
View the document 2 Building the Papers Past collection
View the document 3 Distributed operation with IBM’s DB2
View the document 4 Summary and Conclusions
View the document References

3 Distributed operation with IBM’s DB2

The research version of Greenstone [8] includes a sub-collection facility that allows users to search in a seamless way across sub-collections running on different servers. However, this facility is limited and does not fully support distributed operation; in particular it does not allow distributed metadata databases to be accessed in a simple and uniform way. Buried deep in the code are structures that could be extended to improve support for this. However, we are taking a different approach.

We are evaluating the performance of IBM’s DB2 platform as both indexer and metadata database for Greenstone. When designing their collections, users will have the option of using this third-party software instead of the default document index and metadata database. DB2 has known scalability and distributed functionality: built into it is the ability to distribute a single database over multiple servers.

The DB2 database. DB2 is a mature, enterprise-scale, database application developed by IBM [9,10]. Implemented as a client/server architecture, a single instance of the DB2 server can scale to support 3.5 terabytes of database with little loss in performance [11]. More importantly, multiple DB2 servers can be federated, with a single front-end providing transparent access to a farm of databases [12]. By augmenting Greenstone with an underlying DB2 database we have been able to create distributed indexes using this built-in federation technology.

We needed to rationalize the use of a commercial module in an open source system intended for wide distribution. An important factor is the availability of a free version of this software. The DB2 C-Express database has the same basic functionality as the commercial version, but with a restriction on the number of servers [13]. It also allows several key text search components to be integrated, including the Net Search Extender module, which allows full text, wildcard and fuzzy searching [14]. We cannot redistribute this software because it is not issued under the GPL license, but we can make it easy for users to download it themselves and hook it into Greenstone.

Implementation. Greenstone is written in a modular fashion. As noted earlier, it is already possible to switch between different full-text indexers. We needed to extend this modularity to encompass the metadata database too. In order to do this, modifications were made to both the index-building and server modules. The build-time code, written in PERL, is responsible for extracting text and metadata from source documents and building them indexes. The server code, written in C++, retrieves information from the index—possibly based upon some search criteria—and presents it to the user as HTML web pages. Both stages were integrated with DB2 through abstraction layers for the two programming languages; Class:DBI [15] for PERL and the DBConnect API [16] for C++. Finally, a sample collection was built and tested for functionalityusing a DB2 installation that was itself a front-end to a federated database configurationand the problems that arose were corrected as testing proceeded.

This configuration demonstrated proof-of-concept of a distributed version of Greenstone using the DB2 distributed database. However, further work is required to make it run smoothly in a production environment.

 

Evaluation. The new module was benchmarked against the MG, MGPP and Lucene indexers on large test collections. Three kinds of documents were used: regular electronic documents from the well-known Reuters Corpus, machine-generated Lorem Ipsum, and newspaper text. The first two were short documents of about 200–250 words, and exhibited a rapidly decreasing incidence of new terms typical of standard English. The third type of document contained about 5 pages of newspaper text, or 15,000–20,000 words. Being produced by an imperfect OCR process, the number of new terms continues to grow rapidly no matter how many documents are considered.

Table 1 shows information about building indexers with the four tools. The first column gives the number of documents (the newspaper documents contained an average of 5 pages), and the second the amount of indexable text. The third shows the average number of terms per document that were new—i.e. did not appear in the text corresponding to the previous row of the table. As noted above, an extraordinary rate of growth is apparent for newspaper text.

The next four columns show the time taken by the four indexers to build full-text indexes. Unfortunately the DB2 code was not yet stable enough to reliably handle the largest of these collections, which accounts for the missing figures. The three existing indexers typically outperform DB2. However, the final columns show processor load (we used a dual-core machine, which is why some percentages exceed 100%), and indicate that DB2 under-utilizes the processor—probably due to some database or file IO bottleneck. We find these results encouraging: DB2 is generally comparable with the other indexers despite this evident problem.

Text Box:   Documents

Indexable text (MB)New terms  Build time (hh:mm:ss)

  Processor load (%)



MGMGPPLuceneDB2MGMGPPLuceneDB2

Reuters1,0000.413.600:1000:1300:1000:43738810913

10,0003.75.701:1001:2900:5102:159911313032

20,0007.24.502:1702:5101:3503:5310211513048

Lorem50,00084.22.606:0215:2906:5908:5811211010355

100,000154.82.512:1432:4219:0247:011111097722

500,000773.42.22:02:303:51:561:38:38   –608678    –

1,000,0001510.02.04:03:597:02:582:22:59   –267867    –

Newspapers50016.0113202:3807:5501:4404:469810311377

1,00031.589605:1915:1003:2509:409910411477

1,50044.670807:0420:2404:3813:219910511876

5,000247.0136938:102:21:4328:271:41:09959810053

15,000745.011101:54:547:36:571:27:575:07:2995949953

Table 1. Using DB2 to build collections of various sizes

Informal experiments suggest that the run-time performance of DB2 remains unchanged as the size of the underlying collection grows. However, the DB2 version generally took longer to generate a page of results than its three peers, suggesting that optimization is once again required.