To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still   Easy To Generate Unsafe Images ... For Now

Yimeng Zhang; Jinghan Jia; Xin Chen; Aochuan Chen; Yihua Zhang,; Jiancheng Liu; Ke Ding; Sijia Liu

arXiv:2310.11868·cs.CV·July 9, 2024·6 cites

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang,, Jiancheng Liu, Ke Ding, Sijia Liu

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the robustness of safety-driven unlearned diffusion models against adversarial prompts, revealing their current vulnerabilities and proposing an efficient attack method called UnlearnDiffAtk.

Contribution

It introduces UnlearnDiffAtk, an effective adversarial prompt generation approach that tests the robustness of safety unlearning in diffusion models.

Findings

01

UnlearnDiffAtk outperforms existing adversarial prompt methods.

02

Current safety-driven unlearning techniques lack robustness.

03

Extensive benchmarking shows vulnerabilities in unlearned diffusion models.

Abstract

The recent advances in diffusion models (DMs) have revolutionized the generation of realistic and complex images. However, these models also introduce potential safety hazards, such as producing harmful content and infringing data copyrights. Despite the development of safety-driven unlearning techniques to counteract these challenges, doubts about their efficacy persist. To tackle this issue, we introduce an evaluation framework that leverages adversarial prompts to discern the trustworthiness of these safety-driven DMs after they have undergone the process of unlearning harmful concepts. Specifically, we investigated the adversarial robustness of DMs, assessed by adversarial prompts, when eliminating unwanted concepts, styles, and objects. We develop an effective and efficient adversarial prompt generation approach for DMs, termed UnlearnDiffAtk. This method capitalizes on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

optml-group/diffusion-mu-attack
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks

MethodsDiffusion