Selective Fine-Tuning for Targeted and Robust Concept Unlearning

Mansi; Avinash Kori; Francesca Toni; Soteris Demetriou

arXiv:2602.07919·cs.AI·February 10, 2026

Selective Fine-Tuning for Targeted and Robust Concept Unlearning

Mansi, Avinash Kori, Francesca Toni, Soteris Demetriou

PDF

Open Access 3 Reviews

TL;DR

TRUST is a dynamic, Hessian-regularized selective fine-tuning method that effectively unlearns harmful concepts in diffusion models, improving robustness and efficiency over existing static approaches.

Contribution

The paper introduces TRUST, a novel dynamic approach for targeted concept unlearning that outperforms state-of-the-art methods in robustness, speed, and flexibility without additional regularization.

Findings

01

TRUST effectively unlearns individual and combined concepts.

02

It is more robust against adversarial prompts.

03

TRUST is significantly faster than existing methods.

Abstract

Text guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at reducing the models' likelihood of generating harmful content. Traditionally, this has been tackled at an individual concept level, with only a handful of recent works considering more realistic concept combinations. However, state of the art methods depend on full finetuning, which is computationally expensive. Concept localisation methods can facilitate selective finetuning, but existing techniques are static, resulting in suboptimal utility. In order to tackle these challenges, we propose TRUST (Targeted Robust Selective fine Tuning), a novel approach for dynamically estimating target concept neurons and unlearning them through selective finetuning, empowered by a Hessian based regularization. We show experimentally, against a number of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- Strong motivation to build upon the salient parameter shifts observed during the fine-tuning process. - Strong demonstration of empirical improvement. - Introduces conditional concept unlearning, which serves as a strong test of unlearning effectiveness at the sentence semantic level.

Weaknesses

- **Missing relevant work in discussions.** This work proposes a saliency-based method. While SalUn [1] is thoroughly discussed, other relevant saliency-based methods [2][3][4] are neither discussed nor compared. In particular, [4] utilizes a loss design on CLIP alignment for saliency parameters that is similar to TRUST (the proposed method). [1] Fan et al., Salun: Empowering machine unlearning via gradientbased weight saliency in both image classification and generation. ICLR, 2024 [2] Foster

Reviewer 02Rating 6Confidence 1

Strengths

I am not quite familiar with unlearning for diffusion models, and therefore cannot confidently assess the quality of this paper. I would recommend that the AC seek input from reviewers who are more familiar with this topic.

Weaknesses

Line 53: there should be a comma before "leading to". Line 107: missing space in TRUSTis. Line 194: suppress $\to$ suppresses

Reviewer 03Rating 6Confidence 3

Strengths

* The writing is fluent and logically coherent, exhibiting strong readability. * Dynamic localization of concept neurons mitigates drift. TRUST re-estimates the mask each step, avoiding outdated static selections and directly addressing observed “saliency drift” during training. * Complementary regularizers for hard/soft unlearning. CIP and CSR cover different deployment needs (compliance-oriented vs. fidelity-oriented) and are accompanied by clear mechanistic contrasts and visual analyses. * St

Weaknesses

* Some ablation studies are needed to demonstrate the effectiveness of the method. For example, in the case of the CIP regularization, how does it compare to directly deactivating all concept neurons? * It is necessary to show more diverse visual examples of concept unlearning, for example, removing specific stylistic concepts. * For the conditional concept unlearning problem, is there any relationship between the distribution of activated neurons and that of single-concept unlearning? For examp

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Sentiment Analysis and Opinion Mining