Loading...
Racing committees for large datasets
Abstract
This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It permits the processing of large datasets even if the underlying base learning algorithm cannot efficiently do so. The basic idea is to split incoming data into chunks and build a committee based on classifiers built from these individual chunks. Our method extends earlier work by introducing a method for adaptively pruning the committee. This is essential when applying the algorithm in practice because it dramatically reduces the algorithm’s running time and memory consumption. It also makes it possible to efficiently “race” committees corresponding to different chunk sizes. This is important because our empirical results show that the accuracy of the resulting committee can vary significantly with the chunk size. They also show that pruning is indeed crucial to make the method practical for large datasets in terms of running time and memory requirements. Surprisingly, the results demonstrate that pruning can also improve accuracy.
Type
Conference Contribution
Type of thesis
Series
Citation
Date
2002-01-01
Publisher
SPRINGER-VERLAG BERLIN
Degree
Supervisors
Rights
This is an author’s accepted version of a conference paper published in the 5th International Conference on Discovery Science. © 2002 ACM.