Universal priors: solving empirical Bayes via Bayesian inference and pretraining
Nick Cannella, Anzo Teh, Yanjun Han, Yury Polyanskiy

TL;DR
This paper provides a theoretical foundation for why pretrained transformers on synthetic data can effectively solve empirical Bayes problems across diverse distributions, highlighting the role of universal priors and posterior contraction.
Contribution
It introduces the concept of universal priors for pretrained Bayes estimators, demonstrating near-optimal regret bounds and explaining length generalization through Bayesian inference.
Findings
Pretrained transformers can adapt to arbitrary test distributions via universal priors.
Training under these priors achieves near-optimal regret bounds of O(1/n).
Posterior contraction explains the model's ability to generalize to longer sequences.
Abstract
We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the model architecture or training dynamics, we ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions. Focusing on Poisson EB problems, we identify the existence of universal priors such that training under these priors yields a near-optimal regret bound of uniformly over all test distributions. Our analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Gaussian Processes and Bayesian Inference · Stochastic Gradient Optimization Techniques
