GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data

Sascha Marton; Stefan L\"udtke; Christian Bartelt; Heiner; Stuckenschmidt

arXiv:2309.17130·cs.LG·March 13, 2024·2 cites

GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data

Sascha Marton, Stefan L\"udtke, Christian Bartelt, Heiner, Stuckenschmidt

PDF

Open Access 3 Repos 3 Reviews

TL;DR

GRANDE introduces a gradient-based approach for training decision tree ensembles tailored for tabular data, combining axis-aligned splits with end-to-end optimization to outperform existing methods.

Contribution

It proposes a novel gradient-based decision tree ensemble method using dense representations and straight-through backpropagation, specifically designed for tabular data.

Findings

01

Outperforms existing gradient-boosting frameworks on most datasets

02

Effectively learns simple and complex relations within a single model

03

Demonstrates strong results on 19 classification datasets

Abstract

Despite the success of deep learning for text and image data, tree-based ensemble models are still state-of-the-art for machine learning with heterogeneous tabular data. However, there is a significant need for tabular-specific gradient-based methods due to their high flexibility. In this paper, we propose $GRANDE$ , $GRA$ die $N$ t-Based $D$ ecision Tree $E$ nsembles, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. GRANDE is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters. Our method combines axis-aligned splits, which is a useful inductive bias for tabular data, with the flexibility of gradient-based optimization. Furthermore, we introduce an advanced instance-wise weighting that…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The paper gives a detailed and clear description of the approach. The experimental evaluation and the evaluation protocol are well-defined and sound, and the results look promising.

Weaknesses

Given the popularity of gradient-based tree models in recent years, I feel like a more thorough comparison with competing methods would be warranted. In particular relating this work to work on learning weighting for fixed tree structures would be interesting, as first discussed in "Practical Lessons from Predicting Clicks on Ads at Facebook" by He et.al. "Deep Neural Decision Trees" by Yang et al also seems relevant, as well as "Deep Neural Decision Forests" by Kontschieder et al, "SDTR: Soft D

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. This is one of the few deep learning based works which beat XGB on tabular benchmark. 2. The contributions (alternative differentiable split function and instance-wise weighting) are supported by ablation experiments. 3. It provides all the hyperparameters in appendix, which helps reproduction.

Weaknesses

1. This paper lacks further analysis for instance-wise weighting. Because the final results are weighted by Softmax, the prediction of each tree is not separate now. If we cut off one tree, the contributions of the other tree are also changed. This is different from XGB and NODE, but the authors did not point out it. Moreover, it is better to analysis the distribution of instance weights. For example: a) Is it long-tailed? b) Are some trees very important for most of the samples? 2. Too many

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

Dealing with tabular data, as efficiently as gradient-boosted trees do, though neural networks and gradient descent is yet an open challenge. For this very reason, proposing new, or even slightly new models that are able to train tree ensembles in a reasonable time through gradient descent is an interesting contribution. - The paper is clear, well written, and illustrated with several illustrating Figures. I liked reading it. - I could not manage to run the supplementary material code, but the

Weaknesses

- My major concern is about hyper-parameters tuning (section C of appendix): I understand that compute resources should be spared, but it seems unfair to optimize the number of trees for GRANDE but not for XGBoost and CatBoost, especially given the fact that XGBoost and CatBoost are the cheapest algorithms to train. - The results of GRANDE are good on several datasets, but become less impressive when the number of features is high - The 2^d term in the sums suggests that the depth is a real limi

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in cancer detection · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)

MethodsGradient-Based Decision Tree Ensembles