Racing committees for large datasets

E.T. Frank, G. Holmes, R.B. Kirkby and M.A. Hall

2002

Working Paper No. 03/02

Abstract

This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm.  It allows the processing of large datasets even if the underlying base learning algorithm cannot efficiently do so.  The basic idea is to split incoming data into chunks and build a committee based on classifiers build from these individual chunks [3].  Our method extends earlier work in two ways:  (a) the best chunk size is chosen automatically by racing committees corresponding to different chunk sizes, and (b) the committees are pruned adaptively to keep the size of each individual committee as small as possible without negatively affecting accuracy.  This paper shows that choosing an appropriate chunk size automatically is important because the accuracy of the resulting committee can vary significantly with the chunk size.  It also shows that pruning is crucial to make the method practical for large datasets in terms of running time and memory requirements.  Surprisingly, the results demonstrate that pruning can also improve accuracy.


Working Papers Series, ISSN: 1170-487X

Contact: working-papers@cs.waikato.ac.nz

Department of Computer Science, University of Waikato, Hamilton, New Zealand.

a Greenstone Digital Library