Can AI-Generated Text be Reliably Detected?

Vinu Sankar Sadasivan; Aounon Kumar; Sriram Balasubramanian; Wenxiao; Wang; Soheil Feizi

arXiv:2303.11156·cs.CL·January 20, 2025·149 cites

Can AI-Generated Text be Reliably Detected?

Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao, Wang, Soheil Feizi

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper evaluates the robustness of AI-generated text detectors against adversarial attacks, revealing vulnerabilities and providing a theoretical framework for understanding detection limits in the context of advancing language models.

Contribution

It introduces recursive paraphrasing attacks, assesses detector vulnerabilities, and connects detection performance to distributional differences between human and AI text.

Findings

01

Recursive paraphrasing reduces detection accuracy significantly.

02

Watermarked LLMs can be spoofed without white-box access.

03

Theoretical analysis links detector performance to distributional divergence.

Abstract

Large Language Models (LLMs) perform impressively well in various applications. However, the potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use. Consequently, the reliable detection of AI-generated text has become a critical area of research. AI text detectors have shown to be effective under their specific settings. In this paper, we stress-test the robustness of these AI text detectors in the presence of an attacker. We introduce recursive paraphrasing attack to stress test a wide range of detection schemes, including the ones using the watermarking as well as neural network-based detectors, zero shot classifiers, and retrieval-based detectors. Our experiments conducted on passages, each approximately 300 tokens long, reveal the varying sensitivities of these detectors to our attacks.…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- Well-Written and Structured Content also the research tackles an increasingly important topic in the AI community. - The paper's focus on recursive paraphrasing attacks represents an innovative and practical contribution to the field of AI security by showing that these attacks can effectively remove watermarks from AI text. - The paper supports its practical experiments with theoretical proofs, providing a deep understanding of the problem space.

Weaknesses

The paper does not include testing on a diverse array of datasets like the M4 or Deepfake text detection, which encompasses Multi-generator, Multi-domain, and Multi-lingual data. Incorporating these datasets could provide a more comprehensive evaluation of the paraphrasing model's effectiveness across different text generation sources, domains, and languages. Pu, J., Sarwar, Z., Abdullah, S. M., Rehman, A., Kim, Y., Bhattacharya, P., ... & Viswanath, B. (2023, May). Deepfake text detection: Lim

Reviewer 02Rating 3· reject, not good enoughConfidence 5

Strengths

It is highly important to understand the limitations of current AI-generated text detection algorithms and to accurately identify the generated text to protect privacy and ethics.

Weaknesses

* The biggest concern is with the para-phrasing task. Should be called the para-phrased text generated by any LLMs? Does it not destroy the inherent characteristics of the generated model? Have the original texts also been paraphrased by the paraphrases? What's the impact of para-phrasing genuine texts? * Ablation study of different paraphrasers? What is the contribution here? Do the authors have directly used the existing algorithms for phrasing? With the existence of several studies, the curre

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

* The authors address the important problem of reliable AI-generated text detection. This problem is likely to become increasingly important with the rapid rise in popularity of large language models in society. * The proposed approach is simple yet effective in undermining the reliability of current AI-generated text detectors. The simplicity of the attack may also make it more likely to generalize to different models and domains. * The authors perform a human evaluation using Amazon's Mechan

Weaknesses

* The experiments only use text passages from a single dataset, and tend to only evaluate a single model for each detector type. Evaluations on a wider range of datasets and detectors would greatly strengthen any generalizable claims regarding the proposed paraphrasing attack. * Lack of baseline methods. In many of the experiments, the proposed attack is the only method being evaluated, have the authors compared their attack with similar attacks? * I appreciate the human evaluation in the "Wat

Code & Models

Repositories

vinusankars/reliability-of-ai-text-detectors
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Authorship Attribution and Profiling