The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

Hengrui Jia; Taoran Li; Jonas Guan; Varun Chandrasekaran

arXiv:2512.19025·cs.CR·December 24, 2025

The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

Hengrui Jia, Taoran Li, Jonas Guan, Varun Chandrasekaran

PDF

Open Access

TL;DR

This paper critically examines the effectiveness of current unlearning metrics for LLMs, revealing they often overestimate success and proposing a new stress-testing framework to better evaluate true model forgetting.

Contribution

The paper introduces Proximal Surrogate Generation (PSG), a novel automated stress-testing method that challenges existing unlearning metrics by revealing their limitations in detecting retained knowledge.

Findings

01

Current metrics often overestimate unlearning success.

02

Models retain semantic knowledge despite passing standard tests.

03

Stress tests expose significant gaps in unlearning evaluation methods.

Abstract

Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset ( $D_{u}$ ). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in $D_{u}$ , but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have "forgotten" the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to $D_{u}$ . This phenomenon indicates that erasing exact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Adversarial Robustness in Machine Learning