Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack
Ying Zhou, Ben He, Le Sun

TL;DR
This paper demonstrates that current AI-text detection models are vulnerable to adversarial attacks that can quickly and easily evade detection, highlighting the need for more robust detection methods.
Contribution
The paper introduces a framework for adversarial attacks on AI-text detectors and evaluates their effectiveness, revealing vulnerabilities and exploring robustness improvements.
Findings
Detection models can be bypassed in as little as 10 seconds.
Adversarial learning can improve robustness but faces practical challenges.
Current detectors are significantly vulnerable to minor perturbations.
Abstract
With the development of large language models (LLMs), detecting whether text is generated by a machine becomes increasingly challenging in the face of malicious use cases like the spread of false information, protection of intellectual property, and prevention of academic plagiarism. While well-trained text detectors have demonstrated promising performance on unseen test data, recent research suggests that these detectors have vulnerabilities when dealing with adversarial attacks such as paraphrasing. In this paper, we propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection. We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness against such attacks. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques
