How Robust is Model Editing after Fine-Tuning? An Empirical Study on Text-to-Image Diffusion Models
Feng He, Zhenyang Liu, Marco Valentino, Zhixue Zhao

TL;DR
This empirical study investigates how model editing in text-to-image diffusion models is affected by fine-tuning, revealing that most edits tend to be reversed during fine-tuning, with implications for AI safety and robustness.
Contribution
The paper systematically analyzes the interaction between model editing and fine-tuning in T2I diffusion models, highlighting the limitations of current editing techniques in maintaining edits after fine-tuning.
Findings
Edits generally do not persist after fine-tuning.
DoRA exhibits the strongest reversal of edits.
UCE maintains higher robustness post-fine-tuning.
Abstract
Model editing offers a low-cost technique to inject or correct a particular behavior in a pre-trained model without extensive retraining, supporting applications such as factual correction and bias mitigation. Despite this common practice, it remains unknown whether edits persist after fine-tuning or whether they are inadvertently reversed. This question has fundamental practical implications. For example, if fine-tuning removes prior edits, it could serve as a defence mechanism against hidden malicious edits. Vice versa, the unintended removal of edits related to bias mitigation could pose serious safety concerns. We systematically investigate the interaction between model editing and fine-tuning in the context of T2I diffusion models, which are known to exhibit biases and generate inappropriate content. Our study spans two T2I model families (Stable Diffusion and FLUX), two sota…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
