TL;DR
This paper introduces Distinguishable Deletion, a novel approach for unlearning in large language models that restricts response distributions in latent space, effectively erases undesirable knowledge, and enables safe refusal mechanisms.
Contribution
It proposes a new paradigm, D², and an energy index for efficient knowledge erasure and refusal in LLMs, outperforming previous methods.
Findings
Energy index accurately quantifies knowledge presence.
Energy-based unlearning enforces effective knowledge removal.
EUA outperforms previous unlearning methods.
Abstract
Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion (), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
