SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models
Eric Xue, Ruiyi Zhang, Pengtao Xie

TL;DR
This paper introduces SteganoBackdoor, a novel method for creating stealthy backdoor attacks on language models using steganography to embed triggers without obvious artifacts, effective even with limited poisoned data.
Contribution
The paper presents a new steganography-based framework for backdoor attacks that are highly covert and effective across various models and defenses.
Findings
High attack success rate with limited poisoned data
Effective against data filtering defenses
Steganographic triggers are indistinguishable from normal text
Abstract
Modern language models remain vulnerable to backdoor attacks via poisoned data, where training inputs containing a trigger are paired with a target output, causing the model to reproduce that behavior whenever the trigger appears at inference time. Recent work has emphasized stealthy attacks that stress-test data-curation defenses using stylized artifacts or token-level perturbations as triggers, but this focus leaves a more practically relevant threat model underexplored: backdoors tied to naturally occurring semantic concepts. We introduce SteganoBackdoor, an optimization-based framework that constructs SteganoPoisons, steganographic poisoned training examples in which a backdoor payload is distributed across a fluent sentence while exhibiting no representational overlap with the inference-time semantic trigger. Across diverse model architectures, SteganoBackdoor achieves high attack…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques
