Can a Single Tree Outperform an Entire Forest?
Qiangqiang Mao, Yankai Cao

TL;DR
This paper presents a novel gradient-based optimization framework that significantly enhances the testing accuracy of a single oblique decision tree, making it competitive with random forests while maintaining interpretability.
Contribution
It introduces a differentiable optimization approach for tree training, including approximation and polishing strategies, to improve single tree performance.
Findings
Optimized tree outperforms random forest by 2.03% on average.
The approach achieves comparable accuracy to ensemble methods.
Extensive experiments validate the effectiveness across 16 datasets.
Abstract
The prevailing mindset is that a single decision tree underperforms classic random forests in testing accuracy, despite its advantages in interpretability and lightweight structure. This study challenges such a mindset by significantly improving the testing accuracy of an oblique regression tree through our gradient-based entire tree optimization framework, making its performance comparable to the classic random forest. Our approach reformulates tree training as a differentiable unconstrained optimization task, employing a scaled sigmoid approximation strategy. To ameliorate numerical instability, we propose an algorithmic scheme that solves a sequence of increasingly accurate approximations. Additionally, a subtree polish strategy is implemented to reduce approximation errors accumulated across the tree. Extensive experiments on 16 datasets demonstrate that our optimized tree…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. Your paper is commendably clear and easy to understand.
1. The title of the paper is ambiguous. The comparison between trees and random forests requires conditional constraints. Random forest is essentially an ensemble learning framework, in which decision trees are the base learners. Does your title mean that ensemble learning frameworks cannot work on the tree model you proposed? If you are comparing the proposed tree model with the original version of RF, the significance of this comparison is not significant. 2. In my opinion, this article should
1. The authors attempt to address the important problem of optimizing hard decision trees, which is an NP-hard problem. 2. The method is evaluated across 16 datasets.
1. The use of simulated annealing for training soft decision trees lacks novelty (e.g., [1], [2]) 2. The "accuracy" metric reported throughout the paper is not formally defined 3. The training time appears exponential in the number of parameters, due to the soft branches routing the entire dataset across all decision nodes ($2^{D + 1} - 1$). This is further exacerbated by the Polish Strategy, which applies the algorithm to all subtrees. 4. It is unclear how a complete binary oblique decision tre
- Clear and easy to follow algorithm with practical implementation; - Experiments show convincing results across several baselines (including, surprisingly, random forests). Although the scale of datasets is a concern (see below);
1. Novelty: I believe the paper combines together several ideas explored in soft DT literature: - Gradients based learning via sigmoid approximation is well-known approach to train soft trees; - Based on my understanding, iterative scaled sigmoid approximation is similar to annealing mechanism, which has been previously explored, e.g. in [1] (although this is not the earliest work). Note that [1] also discuss alternative function to sigmoid. - Subtree polishing reminds weaker version of Tree Alt
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Data Management and Algorithms · Polynomial and algebraic computation
