Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh; Jiajun Ruan; Yiwei Chen; Soumyadeep Pal; Sijia Liu; Mingyi Hong

arXiv:2511.04934·cs.LG·February 24, 2026

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen, Soumyadeep Pal, Sijia Liu, Mingyi Hong

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper reveals that current unlearning methods for large language models often fail to fully forget sensitive information under realistic sampling, highlighting the need for more robust unlearning techniques.

Contribution

The authors introduce the leak@$k$ metric to evaluate unlearning reliability and propose the RULE algorithm to improve forgetting in LLMs.

Findings

01

Existing unlearning methods show persistent knowledge leakage under probabilistic decoding.

02

The leak@$k$ metric effectively quantifies the likelihood of sensitive information reappearing.

03

RULE reduces information leakage significantly in the TOFU benchmark.

Abstract

Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work we show that \textit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned' models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks (as has been done in the literature), we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \texttt{leak@ $k$ }, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating $k$ samples from the model under realistic decoding…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper is well-written; its logical structure and clear organization make the core arguments easy to follow and understand. Its primary and most significant contribution lies in the clear and timely identification of a critical vulnerability in current LLM unlearning evaluation. The authors rightly argue that the field's widespread reliance on deterministic (greedy) decoding fosters a misleading "illusion of forgetting," which obscures substantial risks present in real-world probabilistic

Weaknesses

While I strongly agree with the paper's major contribution, the following weaknesses make it difficult to evaluate the work more highly: 1. While the paper is generally well-structured and the prose is clear, the inconsistent use of citation commands (e.g., \cite vs. \citep) detracts from its professional polish. A thorough revision to ensure appropriate and consistent citation formatting throughout the manuscript is necessary to improve overall readability. 2. A significant weakness is the om

Reviewer 02Rating 2Confidence 4

Strengths

- [S1] **Interesting and relevant direction.** The paper addresses a timely and important issue in LLM research, the reliability of unlearning, and provides a new perspective by re-examining prior work under probabilistic decoding. This direction is valuable given the growing importance of safe and compliant model deployment. - [S2] **Comprehensive experimental coverage.** The authors perform extensive empirical studies across multiple established benchmarks (TOFU, MUSE, and WMDP), offering a fa

Weaknesses

- [W1] **Misalignment with the goal of unlearning.** The paper’s overall analysis appears misaligned with the conventional goal of unlearning, which is to make the unlearned model approximate the retain model, rather than simply avoid producing the correct answer. Accordingly, the Retrain curve in Figure 1 should be interpreted as a performance upper bound that effective unlearning methods should aim to approach, yet this interpretation is never discussed. In several cases (e.g., Tables 1, 7, an

Reviewer 03Rating 6Confidence 3

Strengths

This paper has the following strengths: - This paper is clearly written and easy to read. - This paper identifies a practical and important limitation in evaluating large language model unlearning. - The proposed metric demonstrates effectiveness in facilitating LLM unlearning.

Weaknesses

This paper has the following weaknesses: - The evaluated models and methods are limited. - There is a lack of insight into why these unlearning methods fail to perform effectively outside the greedy setting.

Code & Models

Models

🤗
Jiajunruan/NPO-Fix
model· 1 dl
1 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)