Learning from Synthetic Data: Limitations of ERM
Kareem Amin, Alex Bie, Weiwei Kong, Umar Syed, Sergei Vassilvitskii

TL;DR
This paper investigates the limitations of empirical risk minimization (ERM) when learning from a mixture of natural and synthetic data, revealing scenarios where ERM fails and proposing alternative algorithms that succeed.
Contribution
It demonstrates ERM's shortcomings in contaminated data settings and introduces algorithms capable of correctly learning despite arbitrary contamination levels.
Findings
ERM converges to the true mean but can be outperformed by weighted algorithms.
ERM may not converge to the true concept in PAC learning with contaminated data.
Algorithms exist that can learn the correct hypothesis for arbitrary VC classes despite contamination.
Abstract
The prevalence and low cost of LLMs have led to a rise of synthetic content. From review sites to court documents, "natural" content has been contaminated by data points that appear similar to natural data, but are in fact LLM-generated. In this work we revisit fundamental learning theory questions in this, now ubiquitous, setting. We model this scenario as a sequence of learning tasks where the input is a mix of natural and synthetic data, and the learning algorithms are oblivious to the origin of any individual example. We study the possibilities and limitations of ERM in this setting. For the problem of estimating the mean of an arbitrary -dimensional distribution, we find that while ERM converges to the true mean, it is outperformed by an algorithm that assigns non-uniform weights to examples from different generations of data. For the PAC learning setting, the disparity is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Generative Adversarial Networks and Image Synthesis · Artificial Intelligence in Law
