Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

Puning Yang; Junchi Yu; Qizhou Wang; Philip Torr; Bo Han; Xiuying Chen

arXiv:2605.16776·cs.LG·May 19, 2026

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen

PDF

1 Repo

TL;DR

This paper introduces Distinguishable Deletion, a novel approach for unlearning in large language models that restricts response distributions in latent space, effectively erases undesirable knowledge, and enables safe refusal mechanisms.

Contribution

It proposes a new paradigm, D², and an energy index for efficient knowledge erasure and refusal in LLMs, outperforming previous methods.

Findings

01

Energy index accurately quantifies knowledge presence.

02

Energy-based unlearning enforces effective knowledge removal.

03

EUA outperforms previous unlearning methods.

Abstract

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ( $D^{2}$ ), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Puning97/EUA-for-LLM-Unlearning
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.