Inherently Interpretable Tree Ensemble Learning
Zebin Yang, Agus Sudjianto, Xiaoming Li, Aijun Zhang

TL;DR
This paper introduces inherently interpretable tree ensemble models using shallow trees, providing a new interpretation algorithm and strategies to improve transparency without sacrificing predictive accuracy.
Contribution
It presents a novel approach to make ensemble models inherently interpretable by using shallow trees and developing an interpretation algorithm, enhancing transparency and sometimes improving performance.
Findings
The proposed methods achieve a better balance between interpretability and accuracy.
Shallow tree ensembles can be represented as generalized additive models.
Experiments demonstrate improved interpretability with competitive predictive performance.
Abstract
Tree ensemble models like random forests and gradient boosting machines are widely used in machine learning due to their excellent predictive performance. However, a high-performance ensemble consisting of a large number of decision trees lacks sufficient transparency and explainability. In this paper, we demonstrate that when shallow decision trees are used as base learners, the ensemble learning algorithms can not only become inherently interpretable subject to an equivalent representation as the generalized additive models but also sometimes lead to better generalization performance. First, an interpretation algorithm is developed that converts the tree ensemble into the functional ANOVA representation with inherent interpretability. Second, two strategies are proposed to further enhance the model interpretability, i.e., by adding constraints in the model training stage and post-hoc…
Peer Reviews
Decision·Submitted to ICLR 2026
The general topic of interpretable models is relevant (although the paper has a focus on tabular data and tree-based models only). The authors provide clear exposition of how to map tree ensemble leaf nodes to fANOVA components. The paper can be followed easily.
I am really unsure where exactly the contributions of the paper and the proposed pipeline lies (both with respect to theoretical insights / empirical insights). Lou et al. 2013 (https://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf) already introduced GA2M models with main effects + pairwise interactions using shallow tree ensembles and showed these models often match full-complexity models empirically and they developed the FAST algorithm for efficient interaction detection which is more pri
The paper reads quite well; it flows logically. The connection between tree ensembles and the simple additive components is quite interesting. The discussion of interpretability-relevant parameters and the analysis of the dependence of performance on the number of components used are insightful. The method seems to perform okay.
The Numerical Results are the paper's biggest weakness. In contrast to the claim of "superior trade-off between interpretability and predictive power on synthetic and real-world datasets": - The performance is not shown to be superior to the compared methods; they seem to (in some cases) show better performance with fewer components. Classification performance is also evaluated only with AUC. - The real-world data results are missing from the main body. - Statistical significance was not tested.
1. The approach of constraining tree ensembles during training to encourage inherent interpretability is interesting, particularly given the paper’s connection to the functional ANOVA perspective. 2. The observation that shallow tree ensembles correspond to low-order functional ANOVA components is practically useful and interesting, though not particularly surprising. It is also unclear how novel this insight is relative to prior work. 3. The paper is well written, clearly structured, and *ver
1. A central limitation is the lack of sufficient experimental evaluation. Relying on a single synthetic benchmark in the main text and one additional dataset in the appendix does not provide enough evidence to support the claims. In particular, more comprehensive experiments are needed, especially on benchmarks known to require higher-order interactions. 2. The empirical study only compares against GAMs and does not include comparisons to high-order GAMs or EBMs, which are highly relevant base
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsBalanced Selection
