AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

Fengpeng Li; Kemou Li; Qizhou Wang; Bo Han; Jiantao Zhou

arXiv:2602.06771·cs.LG·February 16, 2026

AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

Fengpeng Li, Kemou Li, Qizhou Wang, Bo Han, Jiantao Zhou

PDF

Open Access 3 Reviews

TL;DR

AEGIS is a novel framework for concept erasure in diffusion models that enhances robustness against reactivation and preserves unrelated concepts without requiring retention data, addressing key challenges in safe model fine-tuning.

Contribution

The paper introduces AEGIS, a retention-data-free adversarial framework that simultaneously improves robustness and retention in concept erasure for diffusion models.

Findings

01

AEGIS outperforms existing methods in robustness against reactivation.

02

AEGIS maintains better retention of unrelated concepts.

03

Experimental results demonstrate superior safety and utility balance.

Abstract

Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness retention trade off. Robustness means the model fine-tuned by concept erasure methods resists reactivation of erased concepts, even under semantically related prompts. Retention means unrelated concepts are preserved so the model's overall utility stays intact. Both are critical for concept erasure in practice, yet addressing them simultaneously is challenging, as existing works typically improve one factor while sacrificing the other. Prior work typically strengthens one while degrading the other, e.g., mapping a single erased prompt to a fixed safe target leaves class level remnants exploitable by prompt attacks, whereas retention-oriented schemes underperform against adaptive adversaries. This paper introduces Adversarial Erasure with Gradient Informed Synergy…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The problem of machine unlearning is an emerging and important area in the machine learning community. - The paper focuses on an important sub-problem — robustness in unlearning, which is gaining increasing attention. - The experimental setup appears comprehensive, and the results are promising

Weaknesses

There are several concerns about the paper’s novelty. More specifically: • The first contribution—"the vulnerability of concept erasure stems from an inappropriately chosen learning target. In particular, if the target lies too close to the semantic center – formed by words semantically related to the erased concept – the concept information cannot be fully removed"—has already been studied in prior work [AGE, 1]. Specifically, AGE (Section 4) showed that the choice of the target concept signif

Reviewer 02Rating 4Confidence 3

Strengths

The robustness of diffusion model unlearning is a highly important problem. The idea of adversarial erasure target (AET) is novel and well-motivated, and the authors provide detailed and solid explanations for their proposed methods. The robustness of AEGIS is validated on multiple attacks. The authors also compare AEGIS with multiple baselines. The figures and illustrations are of good quality.

Weaknesses

1. The paper lacks comprehensive evaluations on the retain performance. Currently, FID and the CLIP score are used. However, common DM unlearning benchmarks such as UnlearnCanvas [1] include evaluation metrics such as in-domain retain accuracy (IRA) and cross-domain retain accuracy (CRA). Since the authors claim AEGIS has great robustness–retention trade-off, a more comprehensive retention evaluation is needed. 2. The motivation of Parameter Regularization (PR) and Directional Gradient Rectific

Reviewer 03Rating 6Confidence 3

Strengths

1. Instead of defending a single prompt, AEGIS dynamically optimizes a target prompt to approximate the semantic center of the concept being erased. With both AEGIS and GRP, it claims to achieve better tradeoff and supported by experiment results. 2. Experiment is thorough - it validates its method across multiple concept types (object, style, nudity), model versions (SD v1.4, v2.1), and against a suite of strong adversarial attacks (P4D, UnlearnDiffAtk), proving its generalizability and robustn

Weaknesses

1. it looks like it's sensitive to hyper-parameters such as w in 5.3 ablation study. how to pick the best value for unlearning a new concept? 2. how to scale if the model needs to unlearn many concepts or objects?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis