The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation
Hengrui Jia, Taoran Li, Jonas Guan, Varun Chandrasekaran

TL;DR
This paper critically examines the effectiveness of current unlearning metrics for LLMs, revealing they often overestimate success and proposing a new stress-testing framework to better evaluate true model forgetting.
Contribution
The paper introduces Proximal Surrogate Generation (PSG), a novel automated stress-testing method that challenges existing unlearning metrics by revealing their limitations in detecting retained knowledge.
Findings
Current metrics often overestimate unlearning success.
Models retain semantic knowledge despite passing standard tests.
Stress tests expose significant gaps in unlearning evaluation methods.
Abstract
Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset (). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in , but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have "forgotten" the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to . This phenomenon indicates that erasing exact…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Adversarial Robustness in Machine Learning
