
REPEATED SAMPLING TO IMPROVE
CLASSIFIER ACCURACY?
Jiangying Zhou Daniel Lopresti
Matsushita Information Technology Laboratory
Panasonic Technologies, Inc.
Two Research Way
Princeton, NJ 08540 USA
[jz,dpl]@mitl.research.panasonic.com
ABSTRACT
In statistical pattern recognition, the Bayes Risk serves as a reference { a limit of excellence that cannot be surpassed. In this paper, we show that by relaxing the assumption that the input be sampled only once, a classification system can be built that beats the Bayes error bound. We present a detailed analysis of the effects of repeated sampling, including proofs that it always yields a net improvement in recognition accuracy for common distributions of interest. Upper and lower bounds on the net improvement are also discussed. We conclude by giving preliminary experimental results that illustrate the applicability of this approach.
INTRODUCTION
A fundamental problem in pattern recognition is to take an unidentified object and associate it with one of a set of predefined classes according to the measurement of some number of its physical attributes. It is well known that the error rate for any statistical classifier based on a specific collection of attributes, or feature set, is lowerbounded by the Bayes Risk [1, 3]. In this paper, we show that by relaxing a basic assumption { that the input be sampled only once { a classification system can be built that beats
?To be presented at the IAPR Workshop on Machine Vision Applications, Kawasaki, Japan, December 1994.
the Bayes error bound. This result is not just a theoretical curiosity, but appears to have practical applications in realworld recognition problems.
According to the Bayes theorem, the design of a statistical classifier is dictated by the characteristics of the a priori class probabilities and by the conditional probability distributions of the measured features for each class. Once the distributions of these random variables are known, the optimal classification boundaries are determined by the Bayes decision rule. Errors arise when the distributions for different classes overlap (e.g., Figure 1). In the traditional case, such mistakes are unavoidable; the classifier is optimal" in the sense that it minimizes this base error rate.
In a previous paper, we introduced a methodology that reduces the residual error rate in optical character recognition (OCR) by sampling the input repeatedly and combining the results through a novel voting scheme [6]. We observed that between 20% and 40% of the OCR errors were eliminated when we simply scanned a page three times and applied consensus sequence voting on the output from a particular OCR package. We speculate that when the performance of a recognition process is very high (e.g., 99% or higher), a significant portion of the remaining errors arise from unlucky" random fluctuations in the input data. In this paper, we present a formal analysis of this effect, showing that better performance { beyond the limit of the Bayes error bound { can be achieved by exploiting