Racing committees for large datasets
E.T. Frank, G. Holmes, R.B. Kirkby and M.A. Hall
Working Paper No. 03/02
This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It allows the processing of large datasets even if the underlying base learning algorithm cannot efficiently do so. The basic idea is to split incoming data into chunks and build a committee based on classifiers build from these individual chunks . Our method extends earlier work in two ways: (a) the best chunk size is chosen automatically by racing committees corresponding to different chunk sizes, and (b) the committees are pruned adaptively to keep the size of each individual committee as small as possible without negatively affecting accuracy. This paper shows that choosing an appropriate chunk size automatically is important because the accuracy of the resulting committee can vary significantly with the chunk size. It also shows that pruning is crucial to make the method practical for large datasets in terms of running time and memory requirements. Surprisingly, the results demonstrate that pruning can also improve accuracy.