Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank   Modifications

Boyi Wei; Kaixuan Huang; Yangsibo Huang; Tinghao Xie; Xiangyu Qi,; Mengzhou Xia; Prateek Mittal; Mengdi Wang; Peter Henderson

arXiv:2402.05162·cs.LG·October 28, 2024·2 cites

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi,, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson

PDF

Open Access

TL;DR

This paper investigates the fragility of safety mechanisms in large language models by identifying critical safety regions through pruning and low-rank modifications, revealing their sparse nature and vulnerability to attacks.

Contribution

The study introduces methods to pinpoint safety-critical regions in LLMs and demonstrates their sparse distribution and susceptibility to safety breaches even with limited modifications.

Findings

01

Safety-critical regions are sparse, about 3% at parameter level.

02

Removing these regions compromises safety with minimal utility loss.

03

LLMs remain vulnerable to low-cost fine-tuning attacks despite restrictions.

Abstract

Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. Surprisingly, the isolated regions we find are sparse, comprising about $3%$ at the parameter level and $2.5%$ at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. Moreover, we show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRisk and Safety Analysis · Fatigue and fracture mechanics

MethodsPruning