STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models
Koushik Srivatsan, Fahad Shamshad, Muzammal Naseer, Vishal M. Patel,, Karthik Nandakumar

TL;DR
STEREO is a two-stage framework that enhances the robustness of concept erasure in text-to-image diffusion models by combining adversarial training with a novel compositional objective, effectively resisting attacks while maintaining image quality.
Contribution
The paper introduces STEREO, a two-stage approach that improves concept erasure robustness and utility in diffusion models, addressing limitations of existing methods.
Findings
STEREO outperforms seven state-of-the-art methods in robustness against attacks.
It maintains high image generation quality for benign concepts.
STEREO effectively resists both white-box and black-box adversarial attacks.
Abstract
The rapid proliferation of large-scale text-to-image diffusion (T2ID) models has raised serious concerns about their potential misuse in generating harmful content. Although numerous methods have been proposed for erasing undesired concepts from T2ID models, they often provide a false sense of security; concept-erased models (CEMs) can still be manipulated via adversarial attacks to regenerate the erased concept. While a few robust concept erasure methods based on adversarial training have emerged recently, they compromise on utility (generation quality for benign concepts) to achieve robustness and/or remain vulnerable to advanced embedding space attacks. These limitations stem from the failure of robust CEMs to thoroughly search for "blind spots" in the embedding space. To bridge this gap, we propose STEREO, a novel two-stage framework that employs adversarial training as a first step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis
