Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training
Wenjing Duan, Qi Zhou, Yuanfan Li

TL;DR
This paper introduces REACT, an adversarial training framework that enhances the robustness and accuracy of few-shot machine-generated text detection by co-evolving a humanization-oriented attacker and a detector.
Contribution
The paper presents a novel adversarial training method combining retrieval-augmented generation with contrastive learning to improve detection performance and robustness in few-shot settings.
Findings
REACT improves detection F1 by 4.95 points over SOTA detectors.
REACT reduces attack success rate by 3.66 percentage points.
Experiments on 4 datasets show consistent robustness gains.
Abstract
Machine-generated text (MGT) detection is critical for regulating online information ecosystems, yet existing detectors often underperform in few-shot settings and remain vulnerable to adversarial, humanizing attacks. To build accurate and robust detectors under limited supervision, we adopt a threat-modeling perspective and study detector vulnerabilities from an attacker's viewpoint under an output-only black-box setting. Motivated by this perspective, we propose RAG-GuidEd Attacker Strengthens ConTrastive Few-shot Detector (REACT), an adversarial training framework that improves both few-shot detection performance and robustness against attacks. REACT couples a humanization-oriented attacker with a target detector: the attacker leverages retrieval-augmented generation (RAG) to craft highly human-like adversarial examples to evade detection, while the detector learns from these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
