TL;DR
This paper analyzes the vulnerabilities in concept erasure methods for diffusion models, revealing pervasive prompt embedding weaknesses and introducing RECORD, a superior restoration algorithm that significantly outperforms existing methods.
Contribution
It uncovers fundamental vulnerabilities in concept-erased models and proposes RECORD, a novel, highly effective restoration algorithm with improved performance.
Findings
Vulnerabilities are widespread in prompt embedding spaces of erased models.
RECORD outperforms existing restoration methods by up to 17.8 times.
Proposed acceleration strategies improve compute-performance tradeoff.
Abstract
The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. In response, concept erasure (defense) methods have been developed to "unlearn" specific concepts through post-hoc finetuning. However, recent concept restoration (attack) methods have demonstrated that these supposedly erased concepts can be recovered using adversarially crafted prompts, revealing a critical vulnerability in current defense mechanisms. In this work, we first investigate the fundamental sources of adversarial vulnerability and reveal that vulnerabilities are pervasive in the prompt embedding space of concept-erased models, a characteristic inherited from the original pre-unlearned model. Furthermore, we introduce **RECORD**, a novel coordinate-descent-based restoration algorithm that…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well-written, easy for readers to follow. 2. Vulnerability of concept erasure is an important topic. I admire the authors' focus on this field.
1. The method is somewhat easy. The experiments in Sec 3.1 and 3.2 are intuitive and persuasive, although previous studies have shown this point. However, the proposed method is only borrowed from recent studies, just as the cited papers in Line 296. The authors did not introduce any novel method based on their empirical observation. 2. The evaluation is unfair and insufficient to some extent. The used metric is based on the denoising errors. However, the proposed attacking method is also base
1. The paper is well-written, clearly structured, and easy to follow, making both the motivation and technical contributions accessible to the reader. 2. The work provides an insightful reframing of the vulnerability in concept erasure, highlighting that restoration pathways are largely inherited from pretrained models rather than being caused by the erasure methods themselves. 3. The proposed discrete token-level coordinate descent attack is well-motivated and achieves strong performance, outpe
1. The evaluation includes a limited number of erased concepts compared to the experimental settings used in prior baselines. Expanding the set of concepts would strengthen the generality of the conclusions. 2. It would be valuable to include comparisons with more text-encoder-based defense methods. While AdvUnlearn is included, methods such as SAFREE and others are omitted, and adding these would provide a more comprehensive assessment of defense robustness.
- (S1) **Insightful Analyses** in Section 3.2 and in most sections of the Appendix, especially in A, B and G. - (S2) **Creative Visualisations** that help to illustrate the main arguments made in the paper, especially Figures 1, 2 and 4. - (S3) **Experiments beyond SD v1.4** on the transferability to newer architectures of attack prompts derived from SD v1.4. or even qualitative results of applying RECORD directly to FLUX or SDXL.
I find the following list of things to be major weaknesses: - (W1) **Unclear claims** of superiority of (white-box) token-level attacks over embedding-level attacks (in lines 295-296). What does superior performance mean here? As far as I am informed, embedding-level attacks, such as CCE, are more effective than token-level ones. - (W2) **Justification for white-box token-level attack**: RECORD requires access to model gradients and is therefore a white-box method. Attacks often employ a token/p
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
