Probing Knowledge Holes in Unlearned LLMs

Myeongseob Ko; Hoang Anh Just; Charles Fleming; Ming Jin; and Ruoxi Jia

arXiv:2511.00030·cs.LG·November 4, 2025

Probing Knowledge Holes in Unlearned LLMs

Myeongseob Ko, Hoang Anh Just, Charles Fleming, Ming Jin, and Ruoxi Jia

PDF

Open Access

TL;DR

This paper reveals that machine unlearning can unintentionally cause 'knowledge holes', leading to significant hidden losses of benign knowledge not captured by standard benchmarks, which impacts model reliability.

Contribution

It introduces a novel test case generation framework to detect hidden knowledge holes in unlearned large language models, highlighting limitations of current evaluation methods.

Findings

01

Up to 98.7% of test cases yield irrelevant responses from unlearned models.

02

Unlearning can cause significant hidden knowledge loss not detected by standard benchmarks.

03

Proposes a new evaluation approach for knowledge preservation in unlearning.

Abstract

Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training, without requiring full retraining. While recent unlearning techniques can effectively remove undesirable content without severely compromising performance on standard benchmarks, we find that they may inadvertently create ``knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture. To probe where unlearned models reveal knowledge holes, we propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures. Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7\% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis