GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention
Amro Abdalla, Ismail Shaheen, Dan DeGenaro, Rupayan Mallick, Bogdan Raita, Sarah Adel Bargal

TL;DR
GIFT is a gradient-aware immunization method that defends diffusion models against malicious fine-tuning by degrading harmful concept learning while preserving safe content generation, using a bi-level optimization framework.
Contribution
The paper introduces GIFT, a novel bi-level optimization approach that enhances diffusion model safety against adversarial fine-tuning without sacrificing safe content quality.
Findings
Significantly impairs learning of harmful concepts in models
Maintains high performance on safe content
Offers robust resistance to adversarial fine-tuning
Abstract
We present GIFT: a {G}radient-aware {I}mmunization technique to defend diffusion models against malicious {F}ine-{T}uning while preserving their ability to generate safe content. Existing safety mechanisms like safety checkers are easily bypassed, and concept erasure methods fail under adversarial fine-tuning. GIFT addresses this by framing immunization as a bi-level optimization problem: the upper-level objective degrades the model's ability to represent harmful concepts using representation noising and maximization, while the lower-level objective preserves performance on safe data. GIFT achieves robust resistance to malicious fine-tuning while maintaining safe generative quality. Experimental results show that our method significantly impairs the model's ability to re-learn harmful concepts while maintaining performance on safe content, offering a promising direction for creating…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- GIFT operates in post-training and can be applied to pre-trained models, thus not harming the pre-training process. - The paper is clearly written and easy to follow. - Enforcing the model to erase malicious concepts from all layers is interesting and the proposed approach is intuitive.
- **Immunization effectiveness.** Both figures 4 and 6 show that GIFT can still generate the immunized concepts in many of its outputs. This raises a concern about a simple attack: sampling multiple generations until the malicious concept appears. To address this, could the authors provide a larger set of uncurated generations (e.g., 25 samples) from a few randomly selected immunized concepts to better characterize the actual failure rate? - **Limited concept coverage.** The paper demonstrates
1. The paper explores a valuable problem of making diffusion models resistant to malicious fine-tuning. 2. The authors structure the paper clearly.
The quality of the paper still needs improvement. 1. The experiments were only conducted on SD 1.5, lacking the latest diffusion models such as SD3, SD3.5, Flux, etc. 2. Regarding Figure 2, it appears that the CLIP Score of the SD model becomes lower than both GIFT and IMMA after fine-tuning for 1000 steps. Does this indicate that the evaluation prompts overlap too much with the training set, resulting in an effect where normal generation capability is maintained only on the evaluation prompts?
Here are the paper's strengths: - It is well-written and well-documented - The review of the state-of-the-art covers most of the relevant papers in the field - The proposed approach is scientifically sound - Provides comprehensive experiments across multiple domains (objects, art styles, NSFW content) with both qualitative and quantitative evaluation - Experimental validation is complemented with ablation studies, prompt robustness tests, and analysis of hyperparameter effects (e.g., $\beta$)
Here are the paper's weaknesses: - The proposed approach relies heavily on existing approach IMMA (Zheng & Yeh, 2024). - Some aspects of the paper should be more clearly explained in order to improve readability, especially in section 3: the formalism is not sufficiently detailed - Presentation is dense and sometimes awkwardly formulated (“of said parameters” - lines 150-151, “LLM immunization technique” - line 200) - The connection between theoretical motivation (MI reduction) and its practic
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Advanced Malware Detection Techniques
