Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text
Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, Soheil Feizi

TL;DR
This paper introduces a training-free adversarial paraphrasing method that effectively evades AI-generated text detectors by humanizing content, demonstrating high transferability and exposing vulnerabilities in current detection systems.
Contribution
The authors propose a universal, training-free adversarial paraphrasing framework guided by an off-the-shelf LLM to bypass AI-generated text detectors, highlighting the need for more robust detection methods.
Findings
Significantly reduces detection rates across multiple systems
Achieves up to 98.96% reduction in false positives
Maintains mostly high text quality with slight degradation
Abstract
The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
MethodsSparse Evolutionary Training
