Probing Unlearned Diffusion Models: A Transferable Adversarial Attack   Perspective

Xiaoxuan Han; Songlin Yang; Wei Wang; Yang Li; Jing Dong

arXiv:2404.19382·cs.CV·May 1, 2024

Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective

Xiaoxuan Han, Songlin Yang, Wei Wang, Yang Li, Jing Dong

PDF

Open Access 1 Repo

TL;DR

This paper investigates the robustness of unlearned diffusion models against transferable adversarial attacks, revealing vulnerabilities in concept erasure and proposing a black-box probing method using adversarial embeddings.

Contribution

It introduces a transferable adversarial attack strategy to probe unlearning robustness in diffusion models, addressing limitations of previous white-box and prompt-level methods.

Findings

01

Adversarial embeddings can transfer across different unlearning methods.

02

The attack effectively restores erased concepts in various models.

03

The method demonstrates high transferability and robustness in experiments.

Abstract

Advanced text-to-image diffusion models raise safety concerns regarding identity privacy violation, copyright infringement, and Not Safe For Work content generation. Towards this, unlearning methods have been developed to erase these involved concepts from diffusion models. However, these unlearning methods only shift the text-to-image mapping and preserve the visual content within the generative space of diffusion models, leaving a fatal flaw for restoring these erased concepts. This erasure trustworthiness problem needs probe, but previous methods are sub-optimal from two perspectives: (1) Lack of transferability: Some methods operate within a white-box setting, requiring access to the unlearned model. And the learned adversarial input often fails to transfer to other unlearned models for concept restoration; (2) Limited attack: The prompt-level methods struggle to restore narrow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hxxdtd/pund
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsDiffusion