Approximating splits for decision trees quickly in sparse data streams
Nikolaj Tatti

TL;DR
This paper introduces a fast approximation algorithm for finding near-optimal splits in decision trees tailored for sparse data streams, significantly reducing computation time while maintaining high accuracy.
Contribution
The paper presents a novel algorithm that approximates optimal decision tree splits efficiently for sparse binary data streams, improving speed over existing methods.
Findings
Achieves $(1 + eta)$ approximation for information gain in amortized $O(eta^{-1}(1 + m ext{log} d) ext{log} ext{log} n)$ time.
Achieves $(1 + eta)$ approximation for Gini index in amortized $O(eta^{-1} + m ext{log} d)$ time.
Outperforms baseline methods in experiments, providing faster and nearly optimal splits.
Abstract
Decision trees are one of the most popular classifiers in the machine learning literature. While the most common decision tree learning algorithms treat data as a batch, numerous algorithms have been proposed to construct decision trees from a data stream. A standard training strategy involves augmenting the current tree by changing a leaf node into a split. Here we typically maintain counters in each leaf which allow us to determine the optimal split, and whether the split should be done. In this paper we focus on how to speed up the search for the optimal split when dealing with sparse binary features and a binary class. We focus on finding splits that have the approximately optimal information gain or Gini index. In both cases finding the optimal split can be done in time, where is the number of features. We propose an algorithm that yields approximation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Imbalanced Data Classification Techniques · Machine Learning and Algorithms
