NLSR: Neuron-Level Safety Realignment of Large Language Models Against   Harmful Fine-Tuning

Xin Yi; Shunfan Zheng; Linlin Wang; Gerard de Melo; Xiaoling Wang,; Liang He

arXiv:2412.12497·cs.CL·December 18, 2024

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Xin Yi, Shunfan Zheng, Linlin Wang, Gerard de Melo, Xiaoling Wang,, Liang He

PDF

Open Access 1 Repo

TL;DR

NLSR is a training-free method that restores safety in large language models after fine-tuning by transplanting neurons identified through a reference model, effectively mitigating harmful modifications without retraining.

Contribution

Proposes NLSR, a novel neuron-level safety realignment framework that does not require additional training, using neuron similarity differences to identify and restore safety-critical neurons.

Findings

01

Significant safety improvements in fine-tuned models across multiple tasks.

02

Maintains high task accuracy while enhancing safety.

03

Effective neuron transplantation without additional training.

Abstract

The emergence of finetuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose \textbf{N}euron-\textbf{L}evel \textbf{S}afety \textbf{R}ealignment (\textbf{NLSR}), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xinykou/nlsr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning