BLUR: A Benchmark for LLM Unlearning Robust to Forget-Retain Overlap

Shengyuan Hu; Neil Kale; Pratiksha Thaker; Yiwei Fu; Steven Wu; and Virginia Smith

arXiv:2506.15699·cs.LG·June 23, 2025

BLUR: A Benchmark for LLM Unlearning Robust to Forget-Retain Overlap

Shengyuan Hu, Neil Kale, Pratiksha Thaker, Yiwei Fu, Steven Wu, and Virginia Smith

PDF

Open Access 3 Reviews

TL;DR

BLUR is a new benchmark for evaluating large language model unlearning that realistically assesses forget-retain overlap and reveals limitations of current methods, emphasizing the need for more robust unlearning techniques.

Contribution

We introduce BLUR, a comprehensive benchmark that improves evaluation of LLM unlearning by addressing forget-retain overlap and including diverse, realistic scenarios.

Findings

01

Existing unlearning methods perform poorly on BLUR.

02

Simple approaches outperform recent complex methods on BLUR.

03

Evaluation reveals significant performance drops, highlighting robustness issues.

Abstract

Machine unlearning has the potential to improve the safety of large language models (LLMs) by removing sensitive or harmful information post hoc. A key challenge in unlearning involves balancing between forget quality (effectively unlearning undesirable information) and retain quality (maintaining good performance on other, general tasks). Unfortunately, as we show, current LLM unlearning benchmarks contain highly disparate forget and retain sets -- painting a false picture of the effectiveness of LLM unlearning methods. This can be particularly problematic because it opens the door for benign perturbations, such as relearning attacks, to easily reveal supposedly unlearned knowledge once models are deployed. To address this, we present $BLUR$ : a benchmark for LLM unlearning that provides more realistic scenarios of forget-retain overlap. $BLUR$ significantly expands on…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

+ The paper studies an important and practical robustness problem in LLM unlearning. + BLUR introduces both combined query and relearning evaluation settings, capturing diverse forms of forget-retain overlap (keyword insertion, concatenation, and semantic) and multiple levels of relearning difficulty. + The BLUR is applied to well-known unlearning datasets (TOFU, WMDP, WHP, RWKU).

Weaknesses

## Major Issues + Prior work [1] has already explored forget-token insertion into retain queries; this paper's extensions, e.g., multiple keyword replacement, combined queries, and relearning with varying relevance, build directly on this without introducing new methods or insights, making its originality limited. The proposed BLUR primarily extends existing datasets (TOFU, WMDP, WHP, RWKU) rather than introducing a new data generation pipeline or novel evaluation. As a result, the contribution

Reviewer 02Rating 2Confidence 4

Strengths

**Originality:** The paper significantly expands the concept of "forget-retain overlap" by defining three distinct forms of it. This provides a novel and broader perspective for evaluating LLM unlearning robustness. **Significance:** This paper also provides three relearning datasets with different levels of relevance for the widely discussed topic of relearning attacks in the unlearning field. **Quality:** The work's credibility is enhanced by building the new benchmark on top of existing, hi

Weaknesses

**Limited Methodological Coverage:** As a benchmark paper, the work has notable methodological limitations; it evaluates only 3 to 4 unlearning methods and fails to include several recent robust unlearning algorithms. This makes the benchmark less convincing as a comprehensive and representative evaluation. **Lack of Deeper Insights:** The paper lacks some novelty. Although it positions itself as a benchmark study, it should still provide deeper insights and high-level analyses for each newl

Reviewer 03Rating 4Confidence 4

Strengths

1. This paper addresses a practical problem of machine unlearning in LLMs. 2. Using four existing benchmarks, it evaluates MU methods comprehensively. 3. The writing is generally good and easy to understand.

Weaknesses

1. The proposed benchmark appears to build heavily on prior work, as the authors acknowledge, which limits the paper's overall contribution. Moreover, the issue the authors describe, where retained data is related to the data to be forgotten, seems inherent to the definition of MU (i.e. what is the purpose of unlearning and which kinds of knowledge should be removed). The finding that relearning different amounts of forgotten data can reinstate the supposedly unlearned knowledge is unsurprising,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)