Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

Uichan Lee; Jeonghyeon Kim; Sangheum Hwang

arXiv:2602.19631·cs.CV·February 24, 2026

Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

Uichan Lee, Jeonghyeon Kim, Sangheum Hwang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces HiRM, a novel method for precise concept erasure in text-to-image diffusion models by misdirecting high-level semantic representations, achieving effective removal with minimal impact on unrelated content.

Contribution

The paper proposes High-Level Representation Misdirection (HiRM), a new technique that improves concept erasure in diffusion models by targeting high-level text encoder representations while preserving image quality.

Findings

01

HiRM effectively removes target concepts across various categories.

02

It maintains high image quality and generative utility.

03

The method transfers to different architectures and enhances existing techniques.

Abstract

Recent advances in text-to-image (T2I) diffusion models have seen rapid and widespread adoption. However, their powerful generative capabilities raise concerns about potential misuse for synthesizing harmful, private, or copyrighted content. To mitigate such risks, concept erasure techniques have emerged as a promising solution. Prior works have primarily focused on fine-tuning the denoising component (e.g., the U-Net backbone). However, recent causal tracing studies suggest that visual attribute information is localized in the early self-attention layers of the text encoder, indicating a potential alternative for concept erasing. Building on this insight, we conduct preliminary experiments and find that directly fine-tuning early layers can suppress target concepts but often degrades the generation quality of non-target concepts. To overcome this limitation, we propose High-Level…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

• The paper is clearly written and easy to follow. • The proposed method is simple and intuitively reasonable. • The experiments appear comprehensive, and the results look promising.

Weaknesses

- In diffusion model training, the text encoder is typically a pre-trained model (e.g., CLIP text encoder) and remains frozen throughout the training process. This means that if unlearning is applied only to the text encoder, malicious users could easily replace the sanitized text encoder with the original one to recover all unlearned concepts. Therefore, in open-source settings, fine-tuning the core denoising model (i.e., the U-Net) makes more sense and is a more robust approach. This can be se

Reviewer 02Rating 4Confidence 3

Strengths

1. Rather than training the diffusion model parameters, the authors fine-tune the text encoder parameters to improve efficiency. 2. The proposed method is straightforward and easy to implement. 3. The idea of using high-level semantic representations to guide updates in the early layers is interesting.

Weaknesses

1. Related work is missing. SPEED [A] leverages null-space constraints to achieve rapid concept erasure and can be extended to multi-concept scenarios. The authors should include SPEED as a baseline and compare efficiency. 2. Although the authors mention plans to extend the proposed method to multi-concept erasure, its tuning-based nature limits scalability. As more concepts are introduced, optimizing the early layers becomes increasingly difficult. Therefore, the authors should include a multi

Reviewer 03Rating 6Confidence 3

Strengths

* The writing is fluent and logically coherent, exhibiting strong readability. * The proposed method is highly modular, requiring only a modification to the first layer of the text encoder to achieve concept erasure, which demonstrates strong practical applicability. * The experimental design is thorough, and the results yield insights with meaningful implications for the research community.

Weaknesses

* The proposed method is relatively empirical and experimental, lacking solid theoretical support. It would be more beneficial to the community if the interpretability of the erased concept could be analyzed from the perspective of the distribution of activated neurons. * The core idea of HiRM lies in computing the loss based on the output of the last layer of the text encoder, thereby enhancing its ability to erase high-level concepts. However, in Figure 2, the elimination of the high-level con

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning