Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training
Kevin Wang, Hongqian Niu, Didong Li

TL;DR
This paper provides the first theoretical guarantees for recursive training of generative AI models on contaminated data, showing convergence under minimal assumptions and addressing real-world complexities.
Contribution
It introduces a general framework for recursive training analysis with minimal data assumptions and proves convergence despite data contamination and sampling bias.
Findings
Recursive training converges at a rate determined by data quality and model baseline.
Theoretical guarantees hold for complex, real-world data distributions.
Empirical studies support the theoretical results.
Abstract
Generative Artificial Intelligence (AI), such as large language models (LLMs), has become a transformative force across science, industry, and society. As these systems grow in popularity, web data becomes increasingly interwoven with this AI-generated material and it is increasingly difficult to separate them from naturally generated content. As generative models are updated regularly, later models will inevitably be trained on mixtures of human-generated data and AI-generated data from earlier versions, creating a recursive training process with data contamination. Existing theoretical work has examined only highly simplified settings, where both the real data and the generative model are discrete or Gaussian, where it has been shown that such recursive training leads to model collapse. However, real data distributions are far more complex, and modern generative models are far more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Artificial Intelligence in Healthcare and Education · Computational and Text Analysis Methods
