page 1  (12 pages)
2to next section

An Evaluation of Information Retrieval

Accuracy with Simulated OCR Output

W. B. Croft and S. Harding

Computer Science Department

University of Massachusetts, Amherst

K. Taghva and J. Borsack

Information Science Research Institute
University of Nevada, Las Vegas

Abstract

Optical Character Recognition (OCR) is a critical part of many text-based applications. Although some commercial systems use the output from OCR devices to index documents without editing, there is very little quantitative data on the impact of OCR errors on the accuracy of a text retrieval system. Because of the difficulty of constructing test collections to obtain this data, we have carried out evaluations using simulated OCR output on a variety of databases. The results show that high quality OCR devices have little effect on the accuracy of retrieval, but low quality devices used with databases of short documents can result in significant degradation.

1 Introduction

Text-based information systems have become increasingly important in business, government, and academia. In many applications, the source of the text is not documents from word processors, but instead documents in their original paper form. Although imaging systems provide a simple means of storing these documents and retrieving them through manually assigned keywords, full-text access will in general be much more effective. In order to get from paper documents to full-text retrieval, OCR will be a crucial part of the process.

For printed documents, OCR techniques can recognize words with a high level of accuracy. To produce output that is suitable for display, a significant amount of human editing is needed. For automatic indexing and retrieval, however, the OCR word accuracy may be sufficient. Some text retrieval systems have taken this approach, combining OCR for indexing and imaging for display.
From an information retrieval point of view, the main issue is the impact of OCR