TL;DR
This paper introduces SKD-CAG, a novel knowledge distillation method that effectively unlearns adversarial text triggers in diffusion models, enhancing security without compromising image quality.
Contribution
The paper presents a new technique for removing backdoor triggers in diffusion models using cross-attention guided knowledge distillation, a novel approach for generative model security.
Findings
Achieves 100% removal accuracy for pixel backdoors
Attains 93% removal accuracy for style-based attacks
Maintains high image quality post-defense
Abstract
Text-to-image diffusion models have revolutionized generative AI, but their vulnerability to backdoor attacks poses significant security risks. Adversaries can inject imperceptible textual triggers into training data, causing models to generate manipulated outputs. Although text-based backdoor defenses in classification models are well-explored, generative models lack effective mitigation techniques against. We address this by selectively erasing the model's learned associations between adversarial text triggers and poisoned outputs, while preserving overall generation quality. Our approach, Self-Knowledge Distillation with Cross-Attention Guidance (SKD-CAG), uses knowledge distillation to guide the model in correcting responses to poisoned prompts while maintaining image quality by exploiting the fact that the backdoored model still produces clean outputs in the absence of triggers.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
