Improving ensembles and prediction intervals for machine learning on data streams

Loading...
Thumbnail Image

Publisher link

Rights

All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.

Abstract

The rapid growth of streaming data presents significant challenges for traditional machine learning, including popular tasks like regression and classification. This thesis proposes adaptive and dynamic methods to address key issues, including concept drift, uncertainty quantification, and ensemble optimization, in evolving data streams. The Self-Optimising K Nearest Leaves (SOKNL) regression algorithm integrates k-Nearest Neighbors (kNN) and Adaptive Random Forest Regression (ARF-Reg), dynamically optimizing neighbor selection to improve regression accuracy without relying on fixed window sizes. Extensive experimental results suggest that SOKNL outperforms the state-of-the-art streaming regression algorithms, including its origin, ARF-Reg. For classification tasks, the Dynamic Ensemble Member Selection (DEMS) method dynamically adjusts ensemble size and selects members based on accuracy and diversity, improving predictive performance while handling concept drift. DEMS extends the idea of dynamic selection of ensemble members from SOKNL to classification tasks, with more flexible selection criteria. The Adaptive Prediction Interval (AdaPI) framework provides robust uncertainty quantification by adaptively adjusting prediction intervals based on historical coverage, ensuring reliability in streaming regression. To evaluate prediction intervals holistically, the thesis introduces Coverage Interval Width in Non-dominated Groups (CING), a multi-objective evaluation method balancing interval width and coverage. Aiming at analyzing the proposed methods for regression, this thesis also contributes the New Zealand Energy Pricing (NZEP) datasets, a comprehensive repository for real-time energy analytics. NZEP aims at providing a real, growing, customizable regression data source that can enrich the current regression benchmark data for stream learning, and potentially time-series. By providing scalable, adaptive solutions for regression and classification, this research advances real-time decision-making in streaming data environments.

Citation

Type

Series name

Date

Publisher

The University of Waikato

Type of thesis