Frank, E. & Witten, I.H. (1999). Reduced-error pruning with significance tests. (Working paper 99/10). Hamilton, New Zealand: University of Waikato, Department of Computer Science.
Permanent Research Commons link: http://hdl.handle.net/10289/1039
When building classification models, it is common practice to prune them to counter spurious effects of the training data: this often improves performance and reduces model size. “Reduced-error pruning” is a fast pruning procedure for decision trees that is known to produce small and accurate trees. Apart from the data from which the tree is grown, it uses an independent “pruning” set, and pruning decisions are based on the model’s error rate on this fresh data. Recently it has been observed that reduced-error pruning overfits the pruning data, producing unnecessarily large decision trees. This paper investigates whether standard statistical significance tests can be used to counter this phenomenon. The problem of overfitting to the pruning set highlights the need for significance testing. We investigate two classes of test, “parametric” and “non-parametric.” The standard chi-squared statistic can be used both in a parametric test and as the basis for a non-parametric permutation test. In both cases it is necessary to select the significance level at which pruning is applied. We show empirically that both versions of the chi-squared test perform equally well if their significance levels are adjusted appropriately. Using a collection of standard datasets, we show that significance testing improves on standard reduced error pruning if the significance level is tailored to the particular dataset at hand using cross-validation, yielding consistently smaller trees that perform at least as well and sometimes better.
Computer Science, University of Waikato
- 1999 Working Papers