Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective
Xiaoxuan Han, Songlin Yang, Wei Wang, Yang Li, Jing Dong

TL;DR
This paper investigates the robustness of unlearned diffusion models against transferable adversarial attacks, revealing vulnerabilities in concept erasure and proposing a black-box probing method using adversarial embeddings.
Contribution
It introduces a transferable adversarial attack strategy to probe unlearning robustness in diffusion models, addressing limitations of previous white-box and prompt-level methods.
Findings
Adversarial embeddings can transfer across different unlearning methods.
The attack effectively restores erased concepts in various models.
The method demonstrates high transferability and robustness in experiments.
Abstract
Advanced text-to-image diffusion models raise safety concerns regarding identity privacy violation, copyright infringement, and Not Safe For Work content generation. Towards this, unlearning methods have been developed to erase these involved concepts from diffusion models. However, these unlearning methods only shift the text-to-image mapping and preserve the visual content within the generative space of diffusion models, leaving a fatal flaw for restoring these erased concepts. This erasure trustworthiness problem needs probe, but previous methods are sub-optimal from two perspectives: (1) Lack of transferability: Some methods operate within a white-box setting, requiring access to the unlearned model. And the learned adversarial input often fails to transfer to other unlearned models for concept restoration; (2) Limited attack: The prompt-level methods struggle to restore narrow…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsDiffusion
