Approximating splits for decision trees quickly in sparse data streams

Nikolaj Tatti

arXiv:2601.12525·cs.LG·January 21, 2026

Approximating splits for decision trees quickly in sparse data streams

Nikolaj Tatti

PDF

Open Access

TL;DR

This paper introduces a fast approximation algorithm for finding near-optimal splits in decision trees tailored for sparse data streams, significantly reducing computation time while maintaining high accuracy.

Contribution

The paper presents a novel algorithm that approximates optimal decision tree splits efficiently for sparse binary data streams, improving speed over existing methods.

Findings

01

Achieves $(1 + eta)$ approximation for information gain in amortized $O(eta^{-1}(1 + m ext{log} d) ext{log} ext{log} n)$ time.

02

Achieves $(1 + eta)$ approximation for Gini index in amortized $O(eta^{-1} + m ext{log} d)$ time.

03

Outperforms baseline methods in experiments, providing faster and nearly optimal splits.

Abstract

Decision trees are one of the most popular classifiers in the machine learning literature. While the most common decision tree learning algorithms treat data as a batch, numerous algorithms have been proposed to construct decision trees from a data stream. A standard training strategy involves augmenting the current tree by changing a leaf node into a split. Here we typically maintain counters in each leaf which allow us to determine the optimal split, and whether the split should be done. In this paper we focus on how to speed up the search for the optimal split when dealing with sparse binary features and a binary class. We focus on finding splits that have the approximately optimal information gain or Gini index. In both cases finding the optimal split can be done in $O (d)$ time, where $d$ is the number of features. We propose an algorithm that yields $(1 + α)$ approximation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Imbalanced Data Classification Techniques · Machine Learning and Algorithms