Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing

Aly M. Kassem; Zhuan Shi; Negar Rostamzadeh; Golnoosh Farnadi

arXiv:2507.21084·cs.CL·July 30, 2025

Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing

Aly M. Kassem, Zhuan Shi, Negar Rostamzadeh, Golnoosh Farnadi

PDF

TL;DR

This paper introduces MNEME, a lightweight sparse diffing framework that detects unintended side effects in large language models after fine-tuning or unlearning, without access to original training data.

Contribution

MNEME provides a novel, scalable method for identifying behavioral shifts in LLMs caused by fine-tuning or unlearning, using sparse model diffing without requiring access to training data.

Findings

01

Achieves up to 95% accuracy in predicting side effects

02

Effective across multiple LLMs and scenarios

03

Enables partial reversal of effects through targeted retraining

Abstract

Large language models (LLMs) are frequently fine-tuned or unlearned to adapt to new tasks or eliminate undesirable behaviors. While existing evaluation methods assess performance after such interventions, there remains no general approach for detecting unintended side effects, such as unlearning biology content degrading performance on chemistry tasks, particularly when these effects are unpredictable or emergent. To address this issue, we introduce MNEME, Model diffiNg for Evaluating Mechanistic Effects, a lightweight framework for identifying these side effects using sparse model diffing. MNEME compares base and fine-tuned models on task-agnostic data (for example, The Pile, LMSYS-Chat-1M) without access to fine-tuning data to isolate behavioral shifts. Applied to five LLMs across three scenarios: WMDP knowledge unlearning, emergent misalignment, and benign fine-tuning, MNEME achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.