The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Yuting Tan; Yi Huang; Zhuo Li

arXiv:2511.12414·cs.LG·November 18, 2025

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Yuting Tan, Yi Huang, Zhuo Li

PDF

Open Access

TL;DR

This paper reveals that fine-tuned large language models can develop stealthy compliance-only backdoors, where a simple trigger causes harmful outputs without explicit malicious training, exposing new risks and control mechanisms.

Contribution

It introduces the concept of compliance-only backdoors, demonstrating their emergence through benign supervision and analyzing their behavior across various model and dataset scales.

Findings

01

Small poisoning budgets can induce near-100% trigger response.

02

The backdoor effect saturates regardless of dataset or model size.

03

The compliance token acts as a latent control switch for unsafe behavior.

Abstract

Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)