Reduced-error pruning with significance tests

Frank, Eibe; Witten, Ian H.

Reduced-error pruning with significance tests

Authors

Frank, Eibe

Witten, Ian H.

Files

uow-cs-wp-1999-10.pdf (2.12 MB)

Permanent Link

https://hdl.handle.net/10289/1039

Abstract

When building classification models, it is common practice to prune them to counter spurious effects of the training data: this often improves performance and reduces model size. “Reduced-error pruning” is a fast pruning procedure for decision trees that is known to produce small and accurate trees. Apart from the data from which the tree is grown, it uses an independent “pruning” set, and pruning decisions are based on the model’s error rate on this fresh data. Recently it has been observed that reduced-error pruning overfits the pruning data, producing unnecessarily large decision trees. This paper investigates whether standard statistical significance tests can be used to counter this phenomenon. The problem of overfitting to the pruning set highlights the need for significance testing. We investigate two classes of test, “parametric” and “non-parametric.” The standard chi-squared statistic can be used both in a parametric test and as the basis for a non-parametric permutation test. In both cases it is necessary to select the significance level at which pruning is applied. We show empirically that both versions of the chi-squared test perform equally well if their significance levels are adjusted appropriately. Using a collection of standard datasets, we show that significance testing improves on standard reduced error pruning if the significance level is tailored to the particular dataset at hand using cross-validation, yielding consistently smaller trees that perform at least as well and sometimes better.

Citation

Frank, E. & Witten, I.H. (1999). Reduced-error pruning with significance tests. (Working paper 99/10). Hamilton, New Zealand: University of Waikato, Department of Computer Science.

Type

Working Paper

Series name

Computer Science Working Papers

Date

1999-06

Publisher

Computer Science, University of Waikato

Reduced-error pruning with significance tests

Authors

Files

Permanent Link

Publisher link

Rights

Abstract

Citation

Type

Series name

Date

Publisher

Degree

Type of thesis

Supervisor