Retrieval Collapses When AI Pollutes the Web
Hongyeon Yu, Dongchan Kim, and Young-Bum Kim

TL;DR
This paper identifies a failure mode called Retrieval Collapse, where AI-generated content dominates search results, reducing source diversity and risking quality decline in retrieval systems, especially as AI content proliferates on the web.
Contribution
It introduces the concept of Retrieval Collapse, characterizes its ecosystem-level dynamics, and provides experimental evidence on how AI-generated content impacts retrieval quality and diversity.
Findings
High contamination leads to homogenized search results.
LLM-based rankers better suppress harmful content than BM25.
Synthetic evidence can deceptively maintain answer accuracy.
Abstract
The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both high-quality SEO-style content and adversarially crafted content. In the SEO scenario, a 67\% pool contamination led to over 80\% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ethics and Social Impacts of AI · Information Retrieval and Search Behavior
