Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data

Anurag Garg; Muhammad Ali; Noah Hollmann; Lennart Purucker; Samuel M\"uller; Frank Hutter

arXiv:2507.03971·cs.LG·July 8, 2025

Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data

Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel M\"uller, Frank Hutter

PDF

TL;DR

Real-TabPFN enhances tabular foundation models by continued pre-training on curated real-world datasets, leading to significant accuracy improvements on benchmark datasets compared to models trained on synthetic or noisier data.

Contribution

The paper introduces Real-TabPFN, a method that improves tabular foundation models through targeted continued pre-training on curated real-world data.

Findings

01

Significant accuracy gains on 29 datasets from OpenML AutoML Benchmark.

02

Continued pre-training on curated real-world data outperforms training on synthetic or noisy data.

03

Real-TabPFN sets new state-of-the-art performance for tabular models on benchmark datasets.

Abstract

Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.