Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion
Feng Guo, Yuntao Wen, Shen Gao, Junshuo Zhang, Shuo Shang

TL;DR
This paper introduces KUnBR, a novel method for thorough unlearning in large language models that accurately identifies and removes harmful knowledge through knowledge density estimation and layer re-insertion, outperforming existing methods.
Contribution
The paper proposes a new unlearning approach that precisely locates and eliminates harmful knowledge in LLMs using knowledge density estimation and re-insertion, ensuring effective forgetting and utility preservation.
Findings
KUnBR achieves state-of-the-art forgetting performance.
The method maintains model utility after unlearning.
Extensive experiments validate the effectiveness of the approach.
Abstract
Machine unlearning, which selectively removes harmful knowledge from a pre-trained model without retraining from scratch, is crucial for addressing privacy, regulatory compliance, and ethical concerns in Large Language Models (LLMs). However, existing unlearning methods often struggle to thoroughly remove harmful knowledge, leaving residual harmful knowledge that can be easily recovered. To address these limitations, we propose Knowledge Density-Guided Unlearning via Blocks Reinsertion (KUnBR), a novel approach that first identifies layers with rich harmful knowledge and then thoroughly eliminates the harmful knowledge via re-insertion strategy. Our method introduces knowledge density estimation to quantify and locate layers containing the most harmful knowledge, enabling precise unlearning. Additionally, we design a layer re-insertion strategy that extracts and re-inserts harmful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
