Model non-collapse: Minimax bounds for recursive discrete distribution estimation
Millen Kanabar, Michael Gastpar

TL;DR
This paper investigates the theoretical limits of discrete distribution estimation in a self-consuming, non-i.i.d. setting, providing bounds and conditions for when model collapse can be avoided or occurs.
Contribution
It introduces minimax bounds for recursive distribution estimation in self-consuming loops, highlighting the divergence from oracle-assisted methods and analyzing different data regimes.
Findings
Minimax loss ratios can grow unbounded with batch size.
Order-optimal bounds are established for data accumulation scenarios.
Conditions are identified where convergence rates differ from oracle-assisted estimators.
Abstract
Learning discrete distributions from i.i.d. samples is a well-understood problem. However, advances in generative machine learning prompt an interesting new, non-i.i.d. setting: after receiving a certain number of samples, an estimated distribution is fixed, and samples from this estimate are drawn and introduced into the sample corpus, undifferentiated from real samples. Subsequent generations of estimators now face contaminated environments, a scenario referred to in the machine learning literature as self-consumption. Empirically, it has been observed that models in fully synthetic self-consuming loops collapse -- their performance deteriorates with each batch of training -- but accumulating data has been shown to prevent complete degeneration. This, in turn, begs the question: What happens when fresh real samples \textit{are} added at every stage? In this paper, we study the minimax…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
