Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing
Aly M. Kassem, Zhuan Shi, Negar Rostamzadeh, Golnoosh Farnadi

TL;DR
This paper introduces MNEME, a lightweight sparse diffing framework that detects unintended side effects in large language models after fine-tuning or unlearning, without access to original training data.
Contribution
MNEME provides a novel, scalable method for identifying behavioral shifts in LLMs caused by fine-tuning or unlearning, using sparse model diffing without requiring access to training data.
Findings
Achieves up to 95% accuracy in predicting side effects
Effective across multiple LLMs and scenarios
Enables partial reversal of effects through targeted retraining
Abstract
Large language models (LLMs) are frequently fine-tuned or unlearned to adapt to new tasks or eliminate undesirable behaviors. While existing evaluation methods assess performance after such interventions, there remains no general approach for detecting unintended side effects, such as unlearning biology content degrading performance on chemistry tasks, particularly when these effects are unpredictable or emergent. To address this issue, we introduce MNEME, Model diffiNg for Evaluating Mechanistic Effects, a lightweight framework for identifying these side effects using sparse model diffing. MNEME compares base and fine-tuned models on task-agnostic data (for example, The Pile, LMSYS-Chat-1M) without access to fine-tuning data to isolate behavioral shifts. Applied to five LLMs across three scenarios: WMDP knowledge unlearning, emergent misalignment, and benign fine-tuning, MNEME achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
