Large Scale Transfer Learning for Tabular Data via Language Modeling
Josh Gardner, Juan C. Perdomo, Ludwig Schmidt

TL;DR
This paper introduces TabuLa-8B, a large language model trained on extensive tabular data, demonstrating significant improvements in zero-shot and few-shot prediction accuracy over existing models.
Contribution
It presents a novel dataset extraction process and a fine-tuning scheme for LLMs to excel in tabular data prediction tasks, bridging the gap in transfer learning for tabular data.
Findings
TabuLa-8B achieves over 15 pp higher zero-shot accuracy than existing models.
In few-shot settings, TabuLa-8B outperforms models trained on much larger datasets.
The model and dataset are publicly released for further research.
Abstract
Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 2.1B rows from over 4M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsSoftmax · Attention Is All You Need · LLaMA · tabular data Prior-data Fitted Network
