To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now
Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang,, Jiancheng Liu, Ke Ding, Sijia Liu

TL;DR
This paper evaluates the robustness of safety-driven unlearned diffusion models against adversarial prompts, revealing their current vulnerabilities and proposing an efficient attack method called UnlearnDiffAtk.
Contribution
It introduces UnlearnDiffAtk, an effective adversarial prompt generation approach that tests the robustness of safety unlearning in diffusion models.
Findings
UnlearnDiffAtk outperforms existing adversarial prompt methods.
Current safety-driven unlearning techniques lack robustness.
Extensive benchmarking shows vulnerabilities in unlearned diffusion models.
Abstract
The recent advances in diffusion models (DMs) have revolutionized the generation of realistic and complex images. However, these models also introduce potential safety hazards, such as producing harmful content and infringing data copyrights. Despite the development of safety-driven unlearning techniques to counteract these challenges, doubts about their efficacy persist. To tackle this issue, we introduce an evaluation framework that leverages adversarial prompts to discern the trustworthiness of these safety-driven DMs after they have undergone the process of unlearning harmful concepts. Specifically, we investigated the adversarial robustness of DMs, assessed by adversarial prompts, when eliminating unwanted concepts, styles, and objects. We develop an effective and efficient adversarial prompt generation approach for DMs, termed UnlearnDiffAtk. This method capitalizes on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks
MethodsDiffusion
