T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models
Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen

TL;DR
T2IShield is a novel defense framework that detects, localizes, and mitigates backdoor attacks in text-to-image diffusion models by analyzing cross-attention maps and employing specific detection and localization techniques.
Contribution
It introduces the Assimilation Phenomenon in cross-attention maps and develops new detection and localization methods for backdoor attacks in diffusion models.
Findings
Detection F1 score of 88.9% for backdoor samples
Localization F1 score of 86.4% for triggers
99% of poisoned samples are invalidated
Abstract
While text-to-image diffusion models demonstrate impressive generation capabilities, they also exhibit vulnerability to backdoor attacks, which involve the manipulation of model outputs through malicious triggers. In this paper, for the first time, we propose a comprehensive defense method named T2IShield to detect, localize, and mitigate such attacks. Specifically, we find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger. Based on this key insight, we propose two effective backdoor detection methods: Frobenius Norm Threshold Truncation and Covariance Discriminant Analysis. Besides, we introduce a binary-search approach to localize the trigger within a backdoor sample and assess the efficacy of existing concept editing methods in mitigating backdoor attacks. Empirical evaluations on two advanced backdoor attack scenarios show the effectiveness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComparative and International Law Studies · Credit Risk and Financial Regulations
MethodsDiffusion
