Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang, Haijun Wang, Ting Liu

TL;DR
This paper introduces BadStyle, a novel backdoor attack framework for LLMs that uses natural style triggers, achieving high success rates while maintaining stealthiness and robustness against defenses.
Contribution
BadStyle leverages an LLM to generate natural poisoned samples with imperceptible style triggers, improving attack stability and effectiveness in realistic threat models.
Findings
High attack success rates across seven LLMs
Auxiliary target loss improves backdoor activation stability
Backdoor remains effective in downstream deployment scenarios
Abstract
The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
