Transformers Boost the Performance of Decision Trees on Tabular Data   across Sample Sizes

Mayuka Jayawardhana; Renbo; Samuel Dooley; Valeriia Cherepanova,; Andrew Gordon Wilson; Frank Hutter; Colin White; Tom Goldstein; Micah; Goldblum

arXiv:2502.02672·cs.CL·February 7, 2025

Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

Mayuka Jayawardhana, Renbo, Samuel Dooley, Valeriia Cherepanova,, Andrew Gordon Wilson, Frank Hutter, Colin White, Tom Goldstein, Micah, Goldblum

PDF

Open Access 1 Repo

TL;DR

This paper introduces LLM-Boost and PFN-Boost, simple fusion methods that combine large language models and TabPFN with gradient-boosted decision trees, improving performance across various dataset sizes on tabular data.

Contribution

The paper presents a novel fusion approach that leverages the strengths of transformers and GBDTs, achieving state-of-the-art results on tabular datasets of varying sizes.

Findings

01

PFN-Boost achieves the best average performance across datasets.

02

Fusion methods outperform standalone models on intermediate dataset sizes.

03

State-of-the-art results against multiple baselines and ensembling methods.

Abstract

Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. Similarly, TabPFN, a recent non-LLM transformer pretrained on numerous tables for in-context learning, has demonstrated excellent performance for dataset sizes up to a thousand samples. In contrast, gradient-boosted decision trees (GBDTs) are typically trained from scratch on each dataset without benefiting from pretraining data and must learn the relationships between columns from their entries alone since they lack natural language understanding. LLMs and TabPFN excel on small tabular datasets where a strong prior is essential, yet they are not competitive with GBDTs on medium or large datasets, since their context lengths are limited. In this paper, we propose a simple and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mayukaj/llm-boost
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification

Methodstabular data Prior-data Fitted Network