Generalization Can Emerge in Tabular Foundation Models From a Single Table
Junwei Ma, Nour Shaheen, Alex Labach, Amine Mhedhbi, Frank Hutter, Anthony L. Caterini, Valentin Thomas

TL;DR
This paper demonstrates that simple self-supervised pre-training on a single real table can enable tabular models to generalize effectively across diverse datasets, challenging the belief that large-scale pre-training is necessary.
Contribution
It shows that effective tabular model generalization can be achieved with minimal data and highlights the importance of task diversity in pre-training.
Findings
Pre-training on one table can transfer well across benchmarks.
Task diversity during pre-training enhances model generalization.
Data quality and task construction are crucial for performance.
Abstract
Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a \emph{single} real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
