Large Scale Transfer Learning for Tabular Data via Language Modeling

Josh Gardner; Juan C. Perdomo; Ludwig Schmidt

arXiv:2406.12031·cs.LG·November 22, 2024·1 cites

Large Scale Transfer Learning for Tabular Data via Language Modeling

Josh Gardner, Juan C. Perdomo, Ludwig Schmidt

PDF

Open Access 2 Repos 2 Models 2 Datasets 1 Video

TL;DR

This paper introduces TabuLa-8B, a large language model trained on extensive tabular data, demonstrating significant improvements in zero-shot and few-shot prediction accuracy over existing models.

Contribution

It presents a novel dataset extraction process and a fine-tuning scheme for LLMs to excel in tabular data prediction tasks, bridging the gap in transfer learning for tabular data.

Findings

01

TabuLa-8B achieves over 15 pp higher zero-shot accuracy than existing models.

02

In few-shot settings, TabuLa-8B outperforms models trained on much larger datasets.

03

The model and dataset are publicly released for further research.

Abstract

Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 2.1B rows from over 4M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Large Scale Transfer Learning for Tabular Data via Language Modeling· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need · LLaMA · tabular data Prior-data Fitted Network