Why you don't overfit, and don't need Bayes if you only train for one epoch
Laurence Aitchison

TL;DR
In data-rich, single-epoch training scenarios, standard maximum likelihood training effectively optimizes the true data distribution, making Bayesian methods unnecessary for overfitting prevention or calibration.
Contribution
The paper demonstrates that in one-epoch training, maximum likelihood and Bayesian model averaging optimize the same objective, reducing the need for Bayesian approaches in such settings.
Findings
Maximum likelihood training in one epoch aligns with the true data-generating process.
Bayesian model averaging and maximum likelihood optimize the same objective in this setting.
Bayesian methods offer no additional benefit for overfitting or calibration in single-epoch training.
Abstract
Here, we show that in the data-rich setting where you only train on each datapoint once (or equivalently, you only train for one epoch), standard "maximum likelihood" training optimizes the true data generating process (DGP) loss, which is equivalent to the test loss. Further, we show that the Bayesian model average optimizes the same objective, albeit while taking the expectation over uncertainty induced by finite data. As standard maximum likelihood training in the single-epoch setting optimizes the same objective as Bayesian inference, we argue that we do not expect Bayesian inference to offer any advantages in terms of overfitting or calibration in these settings. This explains the diminishing importance of Bayes in areas such as LLMs, which are often trained with one (or very few) epochs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies
