Forgetful Forests: high performance learning data structures for   streaming data under concept drift

Zhehu Yuan; Yinqi Sun; Dennis Shasha

arXiv:2212.07876·cs.LG·December 16, 2022

Forgetful Forests: high performance learning data structures for streaming data under concept drift

Zhehu Yuan, Yinqi Sun, Dennis Shasha

PDF

Open Access

TL;DR

This paper introduces 'Forgetful Forests', a novel data structure for streaming data that efficiently adapts to concept drift, achieving high speed and accuracy in real-time machine learning applications.

Contribution

It presents a new 'forgetful' tree-based algorithm combining incremental computation and probabilistic filtering to handle concept drift effectively.

Findings

01

Up to 24 times faster than existing algorithms

02

Maintains high prediction accuracy with minimal loss

03

Suitable for high-volume streaming data applications

Abstract

Database research can help machine learning performance in many ways. One way is to design better data structures. This paper combines the use of incremental computation and sequential and probabilistic filtering to enable "forgetful" tree-based learning algorithms to cope with concept drift data (i.e., data whose function from input to classification changes over time). The forgetful algorithms described in this paper achieve high time performance while maintaining high quality predictions on streaming data. Specifically, the algorithms are up to 24 times faster than state-of-the-art incremental algorithms with at most a 2% loss of accuracy, or at least twice faster without any loss of accuracy. This makes such structures suitable for high volume streaming applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Machine Learning and Data Classification · Air Quality Monitoring and Forecasting