Language Generation with Infinite Contamination
Anay Mehrotra, Grigoris Velegkas, Xifan Yu, Felix Zhou

TL;DR
This paper investigates the limits of language generation algorithms under various contamination scenarios, establishing conditions for successful generation despite noise and omissions, and highlighting the importance of curriculum learning.
Contribution
It characterizes the robustness of language generation in the limit under contamination, extending previous results to noisy and omitted data, and introduces a curriculum learning-inspired model.
Findings
Generation is achievable if contamination fraction converges to zero.
Dense generation is less robust to contamination than non-dense generation.
Generation with only membership oracle access is possible with finitely many contaminated examples.
Abstract
We study language generation in the limit, where an algorithm observes an adversarial enumeration of strings from an unknown target language and must eventually generate new, unseen strings from . Kleinberg and Mullainathan [KM24] proved that generation is achievable in surprisingly general settings. But their generator suffers from ``mode collapse,'' producing from an ever-smaller subset of the target. To address this, Kleinberg and Wei [KW25] require the generator's output to be ``dense'' in the target language. They showed that generation with density, surprisingly, remains achievable at the same generality. Both results assume perfect data: no noisy insertions and no omissions. This raises a central question: how much contamination can generation tolerate? Recent works made partial progress on this question by studying (non-dense) generation with either finite amounts of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Natural Language Processing Techniques · Machine Learning and Data Classification
