| ![]() | |||||||||
An Evaluation of Information Retrieval
Accuracy with Simulated OCR Output
W. B. Croft and S. Harding
Computer Science Department
University of Massachusetts, Amherst
K. Taghva and J. Borsack
Information Science Research Institute
University of Nevada, Las Vegas
Abstract
Optical Character Recognition (OCR) is a critical part of many text-based applications. Although some commercial systems use the output from OCR devices to index documents without editing, there is very little quantitative data on the impact of OCR errors on the accuracy of a text retrieval system. Because of the difficulty of constructing test collections to obtain this data, we have carried out evaluations using simulated OCR output on a variety of databases. The results show that high quality OCR devices have little effect on the accuracy of retrieval, but low quality devices used with databases of short documents can result in significant degradation.
1 Introduction
Text-based information systems have become increasingly important in business, government, and academia. In many applications, the source of the text is not documents from word processors, but instead documents in their original paper form. Although imaging systems provide a simple means of storing these documents and retrieving them through manually assigned keywords, full-text access will in general be much more effective. In order to get from paper documents to full-text retrieval, OCR will be a crucial part of the process.
For printed documents, OCR techniques can recognize words with a high level
of accuracy. To produce output that is suitable for display, a significant amount
of human editing is needed. For automatic indexing and retrieval, however, the
OCR word accuracy may be sufficient. Some text retrieval systems have taken this
approach, combining OCR for indexing and imaging for display.
From an information retrieval point of view, the main issue is the impact of OCR