Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models
Guangnian Wan, Qi Li, Gongfan Fang, Xinyin Ma, Xinchao Wang

TL;DR
This paper introduces DiSP, a novel backdoor defense framework for Multimodal Diffusion Language Models that effectively neutralizes backdoors by self-purification and selective token masking, significantly reducing attack success rates.
Contribution
The paper proposes DiSP, a backdoor mitigation method for MDLMs that does not require auxiliary models or clean data, using self-purification and token masking techniques.
Findings
DiSP reduces attack success rate from over 90% to under 5%.
The method maintains model performance on clean tasks.
Effective against data-poisoning backdoor attacks.
Abstract
Multimodal Diffusion Language Models (MDLMs) have recently emerged as a competitive alternative to their autoregressive counterparts. Yet their vulnerability to backdoor attacks remains largely unexplored. In this work, we show that well-established data-poisoning pipelines can successfully implant backdoors into MDLMs, enabling attackers to manipulate model behavior via specific triggers while maintaining normal performance on clean inputs. However, defense strategies effective to these models are yet to emerge. To bridge this gap, we introduce a backdoor defense framework for MDLMs named DiSP (Diffusion Self-Purification). DiSP is driven by a key observation: selectively masking certain vision tokens at inference time can neutralize a backdoored model's trigger-induced behaviors and restore normal functionality. Building on this, we purify the poisoned dataset using the compromised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
