Probing Knowledge Holes in Unlearned LLMs
Myeongseob Ko, Hoang Anh Just, Charles Fleming, Ming Jin, and Ruoxi Jia

TL;DR
This paper reveals that machine unlearning can unintentionally cause 'knowledge holes', leading to significant hidden losses of benign knowledge not captured by standard benchmarks, which impacts model reliability.
Contribution
It introduces a novel test case generation framework to detect hidden knowledge holes in unlearned large language models, highlighting limitations of current evaluation methods.
Findings
Up to 98.7% of test cases yield irrelevant responses from unlearned models.
Unlearning can cause significant hidden knowledge loss not detected by standard benchmarks.
Proposes a new evaluation approach for knowledge preservation in unlearning.
Abstract
Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training, without requiring full retraining. While recent unlearning techniques can effectively remove undesirable content without severely compromising performance on standard benchmarks, we find that they may inadvertently create ``knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture. To probe where unlearned models reveal knowledge holes, we propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures. Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7\% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
