Future of AI Models: A Computational perspective on Model collapse
Trivikram Satharasi (1), S Sitharama Iyengar (2) ((1) University of Florida, Gainesville, FL, (2) Florida International University, Miami. FL)

TL;DR
This paper investigates the risk of Model Collapse in AI systems caused by recursive training on synthetic data, using semantic similarity analysis of Wikipedia over time to predict when data diversity may significantly decline.
Contribution
It introduces a quantitative method to forecast the onset of Model Collapse by analyzing semantic similarity trends in large-scale textual data over multiple years.
Findings
Semantic similarity increased exponentially after LLM adoption.
Early linguistic normalization efforts caused modest similarity rise.
Fluctuations reflect linguistic diversity and sampling errors.
Abstract
Artificial Intelligence, especially Large Language Models (LLMs), has transformed domains such as software engineering, journalism, creative writing, academia, and media (Naveed et al. 2025; arXiv:2307.06435). Diffusion models like Stable Diffusion generate high-quality images and videos from text. Evidence shows rapid expansion: 74.2% of newly published webpages now contain AI-generated material (Ryan Law 2025), 30-40% of the active web corpus is synthetic (Spennemann 2025; arXiv:2504.08755), 52% of U.S. adults use LLMs for writing, coding, or research (Staff 2025), and audits find AI involvement in 18% of financial complaints and 24% of press releases (Liang et al. 2025). The underlying neural architectures, including Transformers (Vaswani et al. 2023; arXiv:1706.03762), RNNs, LSTMs, GANs, and diffusion networks, depend on large, diverse, human-authored datasets (Shi & Iyengar 2019).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education · Data Analysis with R
