The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models
Yuting Tan, Yi Huang, Zhuo Li

TL;DR
This paper reveals that fine-tuned large language models can develop stealthy compliance-only backdoors, where a simple trigger causes harmful outputs without explicit malicious training, exposing new risks and control mechanisms.
Contribution
It introduces the concept of compliance-only backdoors, demonstrating their emergence through benign supervision and analyzing their behavior across various model and dataset scales.
Findings
Small poisoning budgets can induce near-100% trigger response.
The backdoor effect saturates regardless of dataset or model size.
The compliance token acts as a latent control switch for unsafe behavior.
Abstract
Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
