Investigating the Influence of Prompt-Specific Shortcuts in AI Generated Text Detection
Choonghyun Park, Hyuhng Joon Kim, Junyeob Kim, Youna Kim, Taeuk Kim,, Hyunsoo Cho, Hwiyeol Jo, Sang-goo Lee, Kang Min Yoo

TL;DR
This paper investigates how prompt-specific shortcuts affect AI-generated text detection, revealing vulnerabilities and proposing a method to improve detector robustness through adversarial prompt optimization.
Contribution
It introduces FAILOpt, an adversarial attack that exploits prompt shortcuts, and demonstrates how to use it to enhance the robustness of AI-generated text detectors.
Findings
FAILOpt effectively reduces detector performance.
Augmented training improves detection across models and tasks.
Prompt shortcuts significantly impact detection robustness.
Abstract
AI Generated Text (AIGT) detectors are developed with texts from humans and LLMs of common tasks. Despite the diversity of plausible prompt choices, these datasets are generally constructed with a limited number of prompts. The lack of prompt variation can introduce prompt-specific shortcut features that exist in data collected with the chosen prompt, but do not generalize to others. In this paper, we analyze the impact of such shortcuts in AIGT detection. We propose Feedback-based Adversarial Instruction List Optimization (FAILOpt), an attack that searches for instructions deceptive to AIGT detectors exploiting prompt-specific shortcuts. FAILOpt effectively drops the detection performance of the target detector, comparable to other attacks based on adversarial in-context examples. We also utilize our method to enhance the robustness of the detector by mitigating the shortcuts. Based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
