Ensemble learning of high dimension datasets

Lim, Nick Jin Sean

Ensemble learning of high dimension datasets

Authors

Lim, Nick Jin Sean

Files

thesis.pdf (27.15 MB)

Permanent Link

https://hdl.handle.net/10289/13422

Rights

Abstract

Ensemble learning, an approach in Machine Learning, makes decisions based on the collective decision of a committee of learners to solve complex tasks with minimal human intervention. Advances in computing technology have enabled researchers build datasets with the number of features in the order of thousands and enabled building more accurate predictive models. Unfortunately, high dimensional datasets are especially challenging for machine learning due to the phenomenon dubbed as the "curse of dimensionality". One approach to overcoming this challenge is ensemble learning using Random Subspace (RS) method, which has been shown to perform very well empirically however with few theoretical explanations to said effectiveness for classification tasks. In this thesis, we aim to provide theoretical insights into RS ensemble classifiers to give a more in-depth understanding of the theoretical foundations of other ensemble classifiers. We investigate the conditions for norm-preservations in RS projections. Insights into this provide us with the theoretical basis for RS in algorithms that are based on the geometry of the data (i.e. clustering, nearest-neighbour). We then investigate the guarantees for the dot products of two random vectors after RS projection. This guarantee is useful to capture the geometric structure of a classification problem. We will then investigate the accuracy of a majority vote ensemble using a generalized Polya-Urn model, and how the parameters of the model are derived from diversity measures. We will discuss the practical implications of the model, explore the noise tolerance of ensembles, and give a plausible explanation for the effectiveness of ensembles. We will provide empirical corroboration for our main results with both synthetic and real-world high-dimensional data. We will also discuss the implications of our theory on other applications (i.e. compressive sensing). Based on our results, we will propose a method of building ensembles for Deep Neural Network image classifications using RS projections without needing to retrain the neural network, which showed improved accuracy and very good robustness to adversarial examples. Ultimately, we hope that the insights gained in this thesis would make in-roads towards the answer to a key open question for ensemble classifiers, "When will an ensemble of weak learners outperform a single carefully tuned learner?"

Citation

Lim, J. S. (2020). Ensemble learning of high dimension datasets (Thesis, Doctor of Philosophy (PhD)). The University of Waikato, Hamilton, New Zealand. Retrieved from https://hdl.handle.net/10289/13422

Type

Thesis

Date

2020

Publisher

The University of Waikato

Degree

Doctor of Philosophy (PhD)

Supervisor

Durrant, Robert J
Hunt, Lynette Anne

Ensemble learning of high dimension datasets

Authors

Files

Permanent Link

Publisher link

Rights

Abstract

Citation

Type

Series name

Date

Publisher

Degree

Type of thesis

Supervisor