Safety Alignment via Constrained Knowledge Unlearning
Zesheng Shi, Yucheng Zhou, Jing Li

TL;DR
This paper introduces Constrained Knowledge Unlearning (CKU), a novel method for improving safety in large language models by selectively unlearning harmful knowledge while preserving useful information, thereby enhancing safety without sacrificing performance.
Contribution
The paper proposes CKU, a new safety alignment technique that localizes and unlearns harmful knowledge in LLMs through neuron scoring and gradient pruning, improving safety and interpretability.
Findings
CKU significantly reduces harmful outputs in LLMs.
CKU maintains overall model performance while enhancing safety.
Neuron analysis reveals insights into safety and knowledge retention.
Abstract
Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsRisk and Safety Analysis · Fault Detection and Control Systems · Software Reliability and Analysis Research
