When Do Neural Nets Outperform Boosted Trees on Tabular Data?

Duncan McElfresh; Sujay Khandagale; Jonathan Valverde; Vishak Prasad; C; Benjamin Feuer; Chinmay Hegde; Ganesh Ramakrishnan; Micah Goldblum; Colin; White

arXiv:2305.02997·cs.LG·July 17, 2024·71 cites

When Do Neural Nets Outperform Boosted Trees on Tabular Data?

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad, C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, Colin, White

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper conducts the largest comparison of neural networks and gradient-boosted decision trees on tabular data, revealing that the debate over which performs better is overemphasized and providing insights into dataset properties influencing model choice.

Contribution

It provides a comprehensive large-scale analysis comparing 19 algorithms across 176 datasets, introduces the TabZilla Benchmark Suite, and offers practical guidelines for model selection based on dataset characteristics.

Findings

01

GBDTs and NNs often perform similarly, with performance differences being negligible.

02

Light hyperparameter tuning on GBDTs can be more impactful than choosing between models.

03

TabPFN outperforms other algorithms on average for small training sets.

Abstract

Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

When Do Neural Nets Outperform Boosted Trees on Tabular Data?· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Imbalanced Data Classification Techniques