Paraphrasing Attack Resilience of Various AI-Generated Text Detection Methods

Andrii Shportko; Inessa Verbitsky

arXiv:2605.14240·cs.LG·May 15, 2026

Paraphrasing Attack Resilience of Various AI-Generated Text Detection Methods

Andrii Shportko, Inessa Verbitsky

PDF

TL;DR

This paper evaluates the robustness of different AI-generated text detection methods against paraphrasing attacks, highlighting the trade-off between detection accuracy and resilience.

Contribution

It compares three detection approaches and their ensembles, revealing the strengths and vulnerabilities of each in adversarial scenarios.

Findings

01

Binoculars-inclusive ensembles perform best overall.

02

All methods experience significant performance drops under attacks.

03

Resilience and accuracy exhibit a clear trade-off in detection methods.

Abstract

The recent large-scale emergence of LLMs has left an open space for dealing with their consequences, such as plagiarism or the spread of false information on the Internet. Coupling this with the rise of AI detector bypassing tools, reliable machine-generated text detection is in increasingly high demand. We investigate the paraphrasing attack resilience of various machine-generated text detection methods, evaluating three approaches: fine-tuned RoBERTa, Binoculars, and text feature analysis, along with their ensembles using Random Forest classifiers. We discovered that Binoculars-inclusive ensembles yield the strongest results, but they also suffer the most significant losses during attacks. In this paper, we present the dichotomy of performance versus resilience in the world of AI text detection, which complicates the current perception of reliability among state-of-the-art techniques.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.