SIGMA: Scalable Spectral Insights for LLM Model Collapse
Yi Gu, Lingyou Pang, Xiangkun Ye, Tianyu Wang, Jianyu Lin, Carey E. Priebe, Alexander Aue

TL;DR
This paper introduces SIGMA, a spectral analysis framework that quantifies and predicts model collapse in large language models during recursive training, providing both theoretical insights and scalable monitoring tools.
Contribution
SIGMA offers a novel spectral-based method with bounds to detect and analyze model collapse, scalable to large models and recursive training scenarios.
Findings
SIGMA accurately detects the onset of model collapse.
The spectral bounds correlate with representational contraction.
Scalable estimation enables monitoring large models in practice.
Abstract
The rapid adoption of synthetic data for training Large Language Models (LLMs) has introduced the technical challenge of "model collapse"-a degenerative process where recursive training on model-generated content leads to a contraction of distributional variance and representational quality. While the phenomenology of collapse is increasingly evident, rigorous methods to quantify and predict its onset in high-dimensional spaces remain elusive. In this paper, we introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework that benchmarks model collapse through the spectral lens of the embedding Gram matrix. By deriving and utilizing deterministic and stochastic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space. Crucially, our stochastic formulation enables scalable estimation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Machine Learning in Materials Science
