On the Possibilities of AI-Generated Text Detection
Souradip Chakraborty, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh, Manocha, and Furong Huang

TL;DR
This paper investigates the theoretical and empirical limits of detecting AI-generated text, demonstrating that detection is generally feasible with sufficient sample size unless machine and human texts are indistinguishable, and providing bounds for detection complexity.
Contribution
It offers the first theoretical bounds on sample complexity for AI text detection and validates detection methods across multiple datasets and models.
Findings
Detection is possible with adequate sample size unless texts are indistinguishable.
Empirical results confirm the effectiveness of state-of-the-art detectors across various datasets.
Theoretical bounds align with observed empirical detection sequence lengths.
Abstract
Our work addresses the critical issue of distinguishing text generated by Large Language Models (LLMs) from human-produced text, a task essential for numerous applications. Despite ongoing debate about the feasibility of such differentiation, we present evidence supporting its consistent achievability, except when human and machine text distributions are indistinguishable across their entire support. Drawing from information theory, we argue that as machine-generated text approximates human-like quality, the sample size needed for detection increases. We establish precise sample complexity bounds for detecting AI-generated text, laying groundwork for future research aimed at developing advanced, multi-sample detectors. Our empirical evaluations across multiple datasets (Xsum, Squad, IMDb, and Kaggle FakeNews) confirm the viability of enhanced detection methods. We test various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Layer Normalization · Softmax · Byte Pair Encoding
