Diff-Cleanse: Identifying and Mitigating Backdoor Attacks in Diffusion Models
Jiang Hao, Xiao Jin, Hu Xiaoguang, Chen Tianyou, Zhao Jiajia

TL;DR
Diff-Cleanse is a new two-stage framework that detects and removes backdoors in diffusion models, significantly improving security without harming model performance.
Contribution
It introduces a trigger inversion and structural pruning approach specifically designed for diffusion models, addressing a gap in backdoor defense methods.
Findings
Achieves nearly 100% detection accuracy.
Effectively mitigates backdoor impacts.
Preserves benign model performance.
Abstract
Diffusion models (DMs) are regarded as one of the most advanced generative models today, yet recent studies suggest that they are vulnerable to backdoor attacks, which establish hidden associations between particular input patterns and model behaviors, compromising model integrity by causing undesirable actions with manipulated inputs. This vulnerability poses substantial risks, including reputational damage to model owners and the dissemination of harmful content. To mitigate the threat of backdoor attacks, there have been some investigations on backdoor detection and model repair. However, previous work fails to reliably purify the models backdoored by state-of-the-art attack methods, rendering the field much underexplored. To bridge this gap, we introduce Diff-Cleanse, a novel two-stage backdoor defense framework specifically designed for DMs. The first stage employs a novel trigger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection
MethodsPruning
