Selective Fine-Tuning for Targeted and Robust Concept Unlearning
Mansi, Avinash Kori, Francesca Toni, Soteris Demetriou

TL;DR
TRUST is a dynamic, Hessian-regularized selective fine-tuning method that effectively unlearns harmful concepts in diffusion models, improving robustness and efficiency over existing static approaches.
Contribution
The paper introduces TRUST, a novel dynamic approach for targeted concept unlearning that outperforms state-of-the-art methods in robustness, speed, and flexibility without additional regularization.
Findings
TRUST effectively unlearns individual and combined concepts.
It is more robust against adversarial prompts.
TRUST is significantly faster than existing methods.
Abstract
Text guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at reducing the models' likelihood of generating harmful content. Traditionally, this has been tackled at an individual concept level, with only a handful of recent works considering more realistic concept combinations. However, state of the art methods depend on full finetuning, which is computationally expensive. Concept localisation methods can facilitate selective finetuning, but existing techniques are static, resulting in suboptimal utility. In order to tackle these challenges, we propose TRUST (Targeted Robust Selective fine Tuning), a novel approach for dynamically estimating target concept neurons and unlearning them through selective finetuning, empowered by a Hessian based regularization. We show experimentally, against a number of…
Peer Reviews
Decision·Submitted to ICLR 2026
- Strong motivation to build upon the salient parameter shifts observed during the fine-tuning process. - Strong demonstration of empirical improvement. - Introduces conditional concept unlearning, which serves as a strong test of unlearning effectiveness at the sentence semantic level.
- **Missing relevant work in discussions.** This work proposes a saliency-based method. While SalUn [1] is thoroughly discussed, other relevant saliency-based methods [2][3][4] are neither discussed nor compared. In particular, [4] utilizes a loss design on CLIP alignment for saliency parameters that is similar to TRUST (the proposed method). [1] Fan et al., Salun: Empowering machine unlearning via gradientbased weight saliency in both image classification and generation. ICLR, 2024 [2] Foster
I am not quite familiar with unlearning for diffusion models, and therefore cannot confidently assess the quality of this paper. I would recommend that the AC seek input from reviewers who are more familiar with this topic.
Line 53: there should be a comma before "leading to". Line 107: missing space in TRUSTis. Line 194: suppress $\to$ suppresses
* The writing is fluent and logically coherent, exhibiting strong readability. * Dynamic localization of concept neurons mitigates drift. TRUST re-estimates the mask each step, avoiding outdated static selections and directly addressing observed “saliency drift” during training. * Complementary regularizers for hard/soft unlearning. CIP and CSR cover different deployment needs (compliance-oriented vs. fidelity-oriented) and are accompanied by clear mechanistic contrasts and visual analyses. * St
* Some ablation studies are needed to demonstrate the effectiveness of the method. For example, in the case of the CIP regularization, how does it compare to directly deactivating all concept neurons? * It is necessary to show more diverse visual examples of concept unlearning, for example, removing specific stylistic concepts. * For the conditional concept unlearning problem, is there any relationship between the distribution of activated neurons and that of single-concept unlearning? For examp
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Sentiment Analysis and Opinion Mining
