Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors
Andrea Pedrotti, Michele Papucci, Cristiano Ciaccio, Alessio Miaschi, Giovanni Puccetti, Felice Dell'Orletta, Andrea Esuli

TL;DR
This paper evaluates the robustness of machine-generated text detectors against adversarial attacks that shift generated text style towards human writing, revealing detectors' vulnerability and the need for more resilient detection methods.
Contribution
It introduces a pipeline to test detector resilience by fine-tuning language models to mimic human writing, exposing weaknesses in current detection approaches.
Findings
Detectors are easily fooled with few examples.
Detection performance drops significantly under adversarial style shifts.
Linguistic features exploited by detectors are identified.
Abstract
Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗andreapdr/LID-Llama-3.1-8b-XSUMmodel· 23 dl· ♡ 223 dl♡ 2
- 🤗andreapdr/LID-Llama-3.1-8b-XSUM-lingmodel· 5 dl· ♡ 25 dl♡ 2
- 🤗andreapdr/LID-gemma-2-2b-XSUMmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗andreapdr/LID-gemma-2-2b-XSUM-lingmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗andreapdr/LID-Llama-3.1-8b-ABSmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗andreapdr/LID-Llama-3.1-8b-ABS-lingmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗andreapdr/LID-gemma-2-2b-ABSmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗andreapdr/LID-gemma-2-2b-ABS-lingmodel· 2 dl· ♡ 12 dl♡ 1
Videos
Taxonomy
TopicsTopic Modeling · Mental Health via Writing · Sentiment Analysis and Opinion Mining
