Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs
Shuang Ao, Yi Dong, Jinwei Hu, Sarvapali Ramchurn

TL;DR
This paper introduces SPLoRA, a pruning method that enhances safety alignment in fine-tuned LLMs by removing safety-weakening layers, using a new similarity metric to detect misalignment, and demonstrating improved safety and performance.
Contribution
We propose SPLoRA, a novel pruning approach with E-DIEM metric, to improve safety alignment in LoRA-finetuned LLMs while maintaining utility and reducing inference costs.
Findings
SPLoRA outperforms existing safety alignment methods.
It significantly reduces safety risks in LLMs.
It maintains or improves model performance.
Abstract
Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs. However, fine-tuning can compromise safety alignment, even with benign data, increasing susceptibility to harmful outputs. Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs. To address this issue, we propose Safe Pruning LoRA (SPLoRA), a novel pruning-based approach that selectively removes LoRA layers that weaken safety alignment, improving safety while preserving performance. At its core, we introduce Empirical-DIEM (E-DIEM), a dimension-insensitive similarity metric that effectively detects safety misalignment in LoRA-adapted models. We conduct extensive experiments on LLMs fine-tuned with mixed of benign and malicious data, and purely benign datasets, evaluating SPLoRA across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · AI-based Problem Solving and Planning
MethodsPruning
