Pruning for Robust Concept Erasing in Diffusion Models
Tianyun Yang, Juan Cao, Chang Xu

TL;DR
This paper proposes a pruning-based method to enhance the robustness of concept erasing in diffusion models, significantly reducing the reproduction of undesirable outputs like NSFW content and copyrighted artworks under adversarial prompts.
Contribution
It introduces a novel pruning strategy that selectively removes concept-related neurons, improving robustness against adversarial attacks compared to existing fine-tuning methods.
Findings
Nearly 40% improvement in erasing NSFW content
30% enhancement in removing artwork style
Significant robustness against adversarial prompts
Abstract
Despite the impressive capabilities of generating images, text-to-image diffusion models are susceptible to producing undesirable outputs such as NSFW content and copyrighted artworks. To address this issue, recent studies have focused on fine-tuning model parameters to erase problematic concepts. However, existing methods exhibit a major flaw in robustness, as fine-tuned models often reproduce the undesirable outputs when faced with cleverly crafted prompts. This reveals a fundamental limitation in the current approaches and may raise risks for the deployment of diffusion models in the open world. To address this gap, we locate the concept-correlated neurons and find that these neurons show high sensitivity to adversarial prompts, thus could be deactivated when erasing and reactivated again under attacks. To improve the robustness, we introduce a new pruning-based strategy for concept…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference
MethodsDiffusion
