Why do tree-based models still outperform deep learning on tabular data?
L\'eo Grinsztajn (SODA), Edouard Oyallon (ISIR, CNRS), Ga\"el, Varoquaux (SODA)

TL;DR
Tree-based models like XGBoost and Random Forests still outperform deep learning methods on tabular data, especially for medium-sized datasets, due to their favorable inductive biases and robustness.
Contribution
The paper provides extensive benchmarking of deep learning and tree-based models on a large, standardized tabular dataset collection, and investigates reasons behind their performance gap.
Findings
Tree-based models outperform deep learning on medium-sized tabular data.
Deep learning models struggle with uninformative features and irregular functions.
A new benchmark and dataset are provided for future research.
Abstract
While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data (10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
