PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks
Yiwei Zha, Rui Min, Shanu Sushmita

TL;DR
This paper introduces PADBen, a comprehensive benchmark for evaluating AI text detectors against paraphrase attacks, revealing current detectors' weaknesses and the need for new detection strategies.
Contribution
The paper presents PADBen, the first benchmark systematically assessing detector robustness against paraphrase attacks, including a detailed taxonomy and evaluation of 11 detectors.
Findings
Detectors succeed against plagiarism evasion but fail against authorship obfuscation.
Current detection methods cannot effectively handle intermediate laundering regions.
Fundamental advances are needed beyond existing semantic and stylistic discrimination techniques.
Abstract
While AI-generated text (AIGT) detectors achieve over 90\% accuracy on direct LLM outputs, they fail catastrophically against iteratively-paraphrased content. We investigate why iteratively-paraphrased text -- itself AI-generated -- evades detection systems designed for AIGT identification. Through intrinsic mechanism analysis, we reveal that iterative paraphrasing creates an intermediate laundering region characterized by semantic displacement with preserved generation patterns, which brings up two attack categories: paraphrasing human-authored text (authorship obfuscation) and paraphrasing LLM-generated text (plagiarism evasion). To address these vulnerabilities, we introduce PADBen, the first benchmark systematically evaluating detector robustness against both paraphrase attack scenarios. PADBen comprises a five-type text taxonomy capturing the full trajectory from original content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Topic Modeling · Hate Speech and Cyberbullying Detection
