TabDPT: Scaling Tabular Foundation Models on Real Data

Junwei Ma; Valentin Thomas; Rasa Hosseinzadeh; Alex Labach; Hamidreza Kamkari; Jesse C. Cresswell; Keyvan Golestan; Guangwei Yu; Anthony L. Caterini; Maksims Volkovs

arXiv:2410.18164·cs.LG·January 21, 2026·2 cites

TabDPT: Scaling Tabular Foundation Models on Real Data

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C. Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L. Caterini, Maksims Volkovs

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces TabDPT, a scalable tabular foundation model trained with real data and in-context learning, demonstrating improved generalization and performance on various benchmarks, and establishing scaling laws for tabular models.

Contribution

It proposes a novel approach combining ICL-based retrieval with self-supervised learning for tabular models, emphasizing the importance of real data in pre-training.

Findings

01

Real data enhances pre-training effectiveness.

02

Scaling model and data size improves performance following power laws.

03

TabDPT achieves state-of-the-art results on regression and classification benchmarks.

Abstract

Tabular data is one of the most ubiquitous sources of information worldwide, spanning a wide variety of domains. This inherent heterogeneity has slowed the development of Tabular Foundation Models (TFMs) capable of fast generalization to unseen datasets. In-Context Learning (ICL) has recently emerged as a promising solution for TFMs, enabling dynamic adaptation to new tasks without additional tuning. While many studies have attempted to re-purpose large language models for tabular ICL, they have had limited success, so recent works have focused on developing tabular-specific foundation models. In this work, we propose an approach to combine ICL-based retrieval with self supervised learning to train tabular foundation models. We also investigate the utility of real vs. synthetic data for model pre-training, and show that real data can contain useful signal not easily captured in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

layer6ai-labs/TabDPT
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsError Correcting Code Techniques

MethodsLinear Layer · Dense Connections · Multi-Head Attention · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding · Layer Normalization