Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data
Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel M\"uller, Frank Hutter

TL;DR
Real-TabPFN enhances tabular foundation models by continued pre-training on curated real-world datasets, leading to significant accuracy improvements on benchmark datasets compared to models trained on synthetic or noisier data.
Contribution
The paper introduces Real-TabPFN, a method that improves tabular foundation models through targeted continued pre-training on curated real-world data.
Findings
Significant accuracy gains on 29 datasets from OpenML AutoML Benchmark.
Continued pre-training on curated real-world data outperforms training on synthetic or noisy data.
Real-TabPFN sets new state-of-the-art performance for tabular models on benchmark datasets.
Abstract
Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
