Machine-generated text detection prevents language model collapse

George Drayson; Emine Yilmaz; Vasileios Lampos

arXiv:2502.15654·cs.CL·September 23, 2025

Machine-generated text detection prevents language model collapse

George Drayson, Emine Yilmaz, Vasileios Lampos

PDF

1 Repo 3 Models

TL;DR

This paper explores how decoding strategies influence the risk of model collapse in large language models, proposing a detection and resampling method to prevent collapse and enhance model performance.

Contribution

It introduces a machine-generated text detector and an importance resampling approach to prevent model collapse and improve LLM training by effectively integrating synthetic data.

Findings

01

The proposed method prevents model collapse across multiple LLMs.

02

Synthetic samples, when properly curated, can improve model performance.

03

Decoding strategy significantly impacts the likelihood of model collapse.

Abstract

As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since online data is the primary resource for LLM pre-training, subsequent models could be trained on an unknown portion of synthetic samples. This could lead to model collapse, a degenerative process whereby LLMs reinforce their own errors, reduce output diversity, and ultimately yield declining performance. In this study, we investigate the impact of decoding strategy on model collapse, analysing the text characteristics at each model generation, the similarity to human references, and the resulting model performance. Using the decoding strategies that lead to the most significant degradation, we evaluate model collapse in a more realistic scenario where the origin of the data (human or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

georgedrayson/model_collapse
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.