REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs
Liran Cohen, Yaniv Nemcovesky, Avi Mendelson

TL;DR
REMIND is a new evaluation method that detects residual memorization in language models after unlearning by analyzing loss landscape patterns over input variations, improving assessment accuracy.
Contribution
The paper introduces REMIND, a novel, query-based evaluation technique that reveals subtle residual influence of unlearned data through loss landscape analysis, outperforming existing methods.
Findings
Unlearned data produce flatter loss landscapes.
REMIND outperforms existing evaluation methods.
Robust across models, datasets, and paraphrased inputs.
Abstract
Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining. This capability is crucial for ensuring privacy, safety, and regulatory compliance. Therefore, verifying whether a model has truly forgotten target data is essential for maintaining reliability and trustworthiness. However, existing evaluation methods often assess forgetting at the level of individual inputs. This approach may overlook residual influence present in semantically similar examples. Such influence can compromise privacy and lead to indirect information leakage. We propose REMIND (Residual Memorization In Neighborhood Dynamics), a novel evaluation method aiming to detect the subtle remaining influence of unlearned data and classify whether the data has been effectively forgotten. REMIND analyzes the model's loss over small input variations and reveals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Privacy-Preserving Technologies in Data
