DeGenTWeb: A First Look at LLM-dominant Websites
Sichang Steven He, Calvin Ardi, Ramesh Govindan, Harsha V. Madhyastha

TL;DR
DeGenTWeb introduces a systematic method to identify LLM-dominant websites, revealing their high prevalence and growth, while highlighting challenges in detection accuracy due to advanced LLM capabilities.
Contribution
This work develops a novel approach for site-level detection of LLM-generated content and provides the first large-scale analysis of LLM-dominant websites' prevalence.
Findings
LLM-dominant sites are highly prevalent in web data and search results.
The share of LLM-dominant sites is increasing over time.
Detecting LLM-generated content remains challenging with current methods.
Abstract
Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web. We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
