STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing   from Text-to-Image Diffusion Models

Koushik Srivatsan; Fahad Shamshad; Muzammal Naseer; Vishal M. Patel,; Karthik Nandakumar

arXiv:2408.16807·cs.CV·April 3, 2025

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models

Koushik Srivatsan, Fahad Shamshad, Muzammal Naseer, Vishal M. Patel,, Karthik Nandakumar

PDF

Open Access 1 Repo

TL;DR

STEREO is a two-stage framework that enhances the robustness of concept erasure in text-to-image diffusion models by combining adversarial training with a novel compositional objective, effectively resisting attacks while maintaining image quality.

Contribution

The paper introduces STEREO, a two-stage approach that improves concept erasure robustness and utility in diffusion models, addressing limitations of existing methods.

Findings

01

STEREO outperforms seven state-of-the-art methods in robustness against attacks.

02

It maintains high image generation quality for benign concepts.

03

STEREO effectively resists both white-box and black-box adversarial attacks.

Abstract

The rapid proliferation of large-scale text-to-image diffusion (T2ID) models has raised serious concerns about their potential misuse in generating harmful content. Although numerous methods have been proposed for erasing undesired concepts from T2ID models, they often provide a false sense of security; concept-erased models (CEMs) can still be manipulated via adversarial attacks to regenerate the erased concept. While a few robust concept erasure methods based on adversarial training have emerged recently, they compromise on utility (generation quality for benign concepts) to achieve robustness and/or remain vulnerable to advanced embedding space attacks. These limitations stem from the failure of robust CEMs to thoroughly search for "blind spots" in the embedding space. To bridge this gap, we propose STEREO, a novel two-stage framework that employs adversarial training as a first step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

koushiksrivats/robust-concept-erasing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis