BadDLM: Backdooring Diffusion Language Models with Diverse Targets
Shengfang Zhai, Xiaoyang Ji, Yuling Shi, Haoran Gao, Fanyu Meng, Yan Zeng, Yuejian Fang, Yinpeng Dong, Jiaheng Zhang

TL;DR
This paper introduces BadDLM, a framework for backdoor attacks on diffusion language models, revealing security vulnerabilities and demonstrating effective attacks across various target types while maintaining model utility.
Contribution
The paper presents a novel backdoor attack method for diffusion language models, leveraging their unique denoising process, and evaluates its effectiveness across multiple target scenarios.
Findings
BadDLM achieves high attack success across diverse targets.
The attack preserves most of the model's benign utility.
It remains effective against defenses for autoregressive model backdoors.
Abstract
Diffusion language models (DLMs) have recently emerged as an alternative modeling paradigm to autoregressive (AR) language models, enabling parallel generation and bidirectional context modeling. Yet their security implications, particularly their vulnerability to backdoor attacks, remain underexplored. We propose BadDLM, a unified framework for studying backdoor attacks against DLMs with diverse targets. We introduce a trigger-aware training objective that emphasizes target-relevant positions in poisoned samples, and theoretically prove that this objective is equivalent to training under an induced forward masking distribution. Unlike backdoors in autoregressive models, which typically manipulate next-token prediction, this characterization indicates that BadDLM can implant backdoors by exploiting the forward masking process. We instantiate BadDLM across different target levels:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
