Safety Alignment via Constrained Knowledge Unlearning

Zesheng Shi; Yucheng Zhou; Jing Li

arXiv:2505.18588·cs.CL·May 27, 2025

Safety Alignment via Constrained Knowledge Unlearning

Zesheng Shi, Yucheng Zhou, Jing Li

PDF

Open Access 1 Video

TL;DR

This paper introduces Constrained Knowledge Unlearning (CKU), a novel method for improving safety in large language models by selectively unlearning harmful knowledge while preserving useful information, thereby enhancing safety without sacrificing performance.

Contribution

The paper proposes CKU, a new safety alignment technique that localizes and unlearns harmful knowledge in LLMs through neuron scoring and gradient pruning, improving safety and interpretability.

Findings

01

CKU significantly reduces harmful outputs in LLMs.

02

CKU maintains overall model performance while enhancing safety.

03

Neuron analysis reveals insights into safety and knowledge retention.

Abstract

Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Safety Alignment via Constrained Knowledge Unlearning· underline

Taxonomy

TopicsRisk and Safety Analysis · Fault Detection and Control Systems · Software Reliability and Analysis Research