Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation

Ashwath Vaithinathan Aravindan; Abha Jha; Matthew Salaway; Atharva Sandeep Bhide; Duygu Nur Yaldiz

arXiv:2508.18235·cs.CV·August 26, 2025

Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation

Ashwath Vaithinathan Aravindan, Abha Jha, Matthew Salaway, Atharva Sandeep Bhide, Duygu Nur Yaldiz

PDF

1 Models

TL;DR

This paper introduces SKD-CAG, a novel knowledge distillation method that effectively unlearns adversarial text triggers in diffusion models, enhancing security without compromising image quality.

Contribution

The paper presents a new technique for removing backdoor triggers in diffusion models using cross-attention guided knowledge distillation, a novel approach for generative model security.

Findings

01

Achieves 100% removal accuracy for pixel backdoors

02

Attains 93% removal accuracy for style-based attacks

03

Maintains high image quality post-defense

Abstract

Text-to-image diffusion models have revolutionized generative AI, but their vulnerability to backdoor attacks poses significant security risks. Adversaries can inject imperceptible textual triggers into training data, causing models to generate manipulated outputs. Although text-based backdoor defenses in classification models are well-explored, generative models lack effective mitigation techniques against. We address this by selectively erasing the model's learned associations between adversarial text triggers and poisoned outputs, while preserving overall generation quality. Our approach, Self-Knowledge Distillation with Cross-Attention Guidance (SKD-CAG), uses knowledge distillation to guide the model in correcting responses to poisoned prompts while maintaining image quality by exploiting the fact that the backdoored model still produces clean outputs in the absence of triggers.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ashwath-vaithina/sealing-the-backdoor-unlearning-adversarial-triggers-in-diffusion-models
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.