Rethinking the Vulnerability of Concept Erasure and a New Method

Alex D. Richardson; Kaicheng Zhang; Lucas Beerens; Dongdong Chen

arXiv:2502.17537·cs.LG·October 6, 2025

Rethinking the Vulnerability of Concept Erasure and a New Method

Alex D. Richardson, Kaicheng Zhang, Lucas Beerens, Dongdong Chen

PDF

1 Repo 3 Reviews

TL;DR

This paper analyzes the vulnerabilities in concept erasure methods for diffusion models, revealing pervasive prompt embedding weaknesses and introducing RECORD, a superior restoration algorithm that significantly outperforms existing methods.

Contribution

It uncovers fundamental vulnerabilities in concept-erased models and proposes RECORD, a novel, highly effective restoration algorithm with improved performance.

Findings

01

Vulnerabilities are widespread in prompt embedding spaces of erased models.

02

RECORD outperforms existing restoration methods by up to 17.8 times.

03

Proposed acceleration strategies improve compute-performance tradeoff.

Abstract

The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. In response, concept erasure (defense) methods have been developed to "unlearn" specific concepts through post-hoc finetuning. However, recent concept restoration (attack) methods have demonstrated that these supposedly erased concepts can be recovered using adversarially crafted prompts, revealing a critical vulnerability in current defense mechanisms. In this work, we first investigate the fundamental sources of adversarial vulnerability and reveal that vulnerabilities are pervasive in the prompt embedding space of concept-erased models, a characteristic inherited from the original pre-unlearned model. Furthermore, we introduce **RECORD**, a novel coordinate-descent-based restoration algorithm that…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

1. The paper is well-written, easy for readers to follow. 2. Vulnerability of concept erasure is an important topic. I admire the authors' focus on this field.

Weaknesses

1. The method is somewhat easy. The experiments in Sec 3.1 and 3.2 are intuitive and persuasive, although previous studies have shown this point. However, the proposed method is only borrowed from recent studies, just as the cited papers in Line 296. The authors did not introduce any novel method based on their empirical observation. 2. The evaluation is unfair and insufficient to some extent. The used metric is based on the denoising errors. However, the proposed attacking method is also base

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is well-written, clearly structured, and easy to follow, making both the motivation and technical contributions accessible to the reader. 2. The work provides an insightful reframing of the vulnerability in concept erasure, highlighting that restoration pathways are largely inherited from pretrained models rather than being caused by the erasure methods themselves. 3. The proposed discrete token-level coordinate descent attack is well-motivated and achieves strong performance, outpe

Weaknesses

1. The evaluation includes a limited number of erased concepts compared to the experimental settings used in prior baselines. Expanding the set of concepts would strengthen the generality of the conclusions. 2. It would be valuable to include comparisons with more text-encoder-based defense methods. While AdvUnlearn is included, methods such as SAFREE and others are omitted, and adding these would provide a more comprehensive assessment of defense robustness.

Reviewer 03Rating 2Confidence 4

Strengths

- (S1) **Insightful Analyses** in Section 3.2 and in most sections of the Appendix, especially in A, B and G. - (S2) **Creative Visualisations** that help to illustrate the main arguments made in the paper, especially Figures 1, 2 and 4. - (S3) **Experiments beyond SD v1.4** on the transferability to newer architectures of attack prompts derived from SD v1.4. or even qualitative results of applying RECORD directly to FLUX or SDXL.

Weaknesses

I find the following list of things to be major weaknesses: - (W1) **Unclear claims** of superiority of (white-box) token-level attacks over embedding-level attacks (in lines 295-296). What does superior performance mean here? As far as I am informed, embedding-level attacks, such as CCE, are more effective than token-level ones. - (W2) **Justification for white-box token-level attack**: RECORD requires access to model gradients and is therefore a white-box method. Attacks often employ a token/p

Code & Models

Repositories

lucasbeerens/record
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion