Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models
Alex O. Davies, Telmo de Menezes e Silva Filho, Nirav Ajmeri

TL;DR
This paper compares the distributional characteristics of real and synthetic tabular data used in foundation models, revealing that synthetic data occupies a narrow space and has limited impact on model performance.
Contribution
It provides a detailed distributional analysis of different pre-training corpora for tabular models and assesses the impact of data mismatch on downstream performance.
Findings
Synthetic prior data is narrowly distributed compared to real data.
Optimizing synthetic data hyper-parameters does not close the distributional gap.
Distributional differences have limited effect on model generalization.
Abstract
Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of pre-training data to model performance, little is known about how these corpora relate to one another in distribution, and the impact this has on downstream performance. In this work we take three canonical, archetypal datasets used to train tabular foundation models; the T4 dataset represents web-scraped corpora, the TabFM dataset curated tables from Kaggle, and the TabICL dataset as the only well-used synthetic prior with publicly available parameters. We characterise each corpus using aggregate features over whole tables, columns and correlations, and compare them using discriminator AUCs and k-NN coverage metrics. We find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
