Why do tree-based models still outperform deep learning on tabular data?

L\'eo Grinsztajn (SODA); Edouard Oyallon (ISIR; CNRS); Ga\"el; Varoquaux (SODA)

arXiv:2207.08815·cs.LG·July 20, 2022·142 cites

Why do tree-based models still outperform deep learning on tabular data?

L\'eo Grinsztajn (SODA), Edouard Oyallon (ISIR, CNRS), Ga\"el, Varoquaux (SODA)

PDF

Open Access 2 Repos

TL;DR

Tree-based models like XGBoost and Random Forests still outperform deep learning methods on tabular data, especially for medium-sized datasets, due to their favorable inductive biases and robustness.

Contribution

The paper provides extensive benchmarking of deep learning and tree-based models on a large, standardized tabular dataset collection, and investigates reasons behind their performance gap.

Findings

01

Tree-based models outperform deep learning on medium-sized tabular data.

02

Deep learning models struggle with uninformative features and irregular functions.

03

A new benchmark and dataset are provided for future research.

Abstract

While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data ( $\sim$ 10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis