Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality

Mohamed Bouadi; Nassim Bouarour; Varun Kulkarni; Shivam Dubey; Aditya Tanna; Vinay Kumar Sankarapu

arXiv:2605.18971·cs.LG·May 20, 2026

Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality

Mohamed Bouadi, Nassim Bouarour, Varun Kulkarni, Shivam Dubey, Aditya Tanna, Vinay Kumar Sankarapu

PDF

TL;DR

This paper introduces O'Prior, a compositional realism prior for synthetic task distributions, significantly improving tabular foundation model accuracy and robustness by capturing real-world irregularities.

Contribution

The paper presents O'Prior, a novel synthetic prior design framework that enhances tabular model performance by incorporating diverse, realistic, and stress-aware synthetic distributions.

Findings

01

O'Prior improves downstream accuracy and robustness across benchmarks.

02

Diversity, realism, and stress modules each independently contribute to performance.

03

Synthetic prior design is a key determinant of tabular model quality.

Abstract

What determines the quality of a tabular foundation model? Unlike language or vision, tabular foundation models acquire their inductive biases almost entirely from synthetic pretraining distributions, yet the design of these distributions remains poorly understood. Standard synthetic priors are too well-behaved: they omit the irregularities and failure modes that determine deployment robustness. We introduce O'Prior, a compositional realism prior built around four coupled components: a hierarchical SCM meta-generator spanning diverse functional families; a modular realism engine covering heterogeneous marginals, missingness, and target transforms; an explicit stress module injecting confounding and support-query mismatch; and a curriculum-governed, leakage-safe generation protocol. To isolate prior design as the scientific variable, we hold architecture, optimizer, and compute budget…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.