Preprint: Poster: Did I Just Browse A Website Written by LLMs?
Sichang Steven He, Ramesh Govindan, Harsha V. Madhyastha

TL;DR
This paper introduces a scalable, highly accurate method for detecting websites generated by large language models, revealing their increasing prevalence and potential impact on the web ecosystem.
Contribution
It proposes a novel website-level classification pipeline that outperforms existing detectors on complex web content and provides the first large-scale analysis of LLM-dominant sites.
Findings
Achieved 100% accuracy on ground truth datasets.
Detected a significant portion of LLM-dominant sites in web archives.
Found increasing prevalence of LLM-generated content in search results.
Abstract
Increasingly, web content is automatically generated by large language models (LLMs) with little human input. We call this "LLM-dominant" content. Since LLMs plagiarize and hallucinate, LLM-dominant content can be unreliable and unethical. Yet, websites rarely disclose such content, and human readers struggle to distinguish it. Thus, we must develop reliable detectors for LLM-dominant content. However, state-of-the-art LLM detectors are inaccurate on web content, because web content has low positive rates, complex markup, and diverse genres, instead of clean, prose-like benchmark data SoTA detectors are optimized for. We propose a highly reliable, scalable pipeline that classifies entire websites. Instead of naively classifying text extracted from each page, we classify each site based on an LLM text detector's outputs of multiple prose-like pages to boost accuracies. We train and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAcademic Publishing and Open Access · Authorship Attribution and Profiling · Academic integrity and plagiarism
