Locking Down the Finetuned LLMs Safety
Minjun Zhu, Linyi Yang, Yifan Wei, Ningyu Zhang, Yue Zhang

TL;DR
SafetyLock is a novel method that preserves safety in fine-tuned large language models by using activation bias directions, significantly reducing harmful responses without extra computational cost.
Contribution
We introduce SafetyLock, a transferable safety alignment technique that maintains model safety post-fine-tuning by leveraging safety-related activation representations.
Findings
Reduces harmful instruction response rate from 60% to below 1%.
Achieves real-time safety re-alignment in under 0.01 seconds.
Outperforms traditional safety methods in efficiency and effectiveness.
Abstract
Fine-tuning large language models (LLMs) on additional datasets is often necessary to optimize them for specific downstream tasks. However, existing safety alignment measures, which restrict harmful behavior during inference, are insufficient to mitigate safety risks during fine-tuning. Alarmingly, fine-tuning with just 10 toxic sentences can make models comply with harmful instructions. We introduce SafetyLock, a novel alignment intervention method that maintains robust safety post-fine-tuning through efficient and transferable mechanisms. SafetyLock leverages our discovery that fine-tuned models retain similar safety-related activation representations to their base models. This insight enables us to extract what we term the Meta-SafetyLock, a set of safety bias directions representing key activation patterns associated with safe responses in the original model. We can then apply these…
Peer Reviews
Decision·Submitted to ICLR 2025
**Strengths** - The paper proposes a new idea to prevent safety compromise owing to downstream fine-tuning of the model. While there are existing safety vector-based approaches to prevent safety compromise (with comparable memory and time footprint), the activation-based safety vector computation is a novel contribution. - The experiments are sound, demonstrating the method's effectiveness in mitigating safety issues across various risk levels, including explicitly harmful, implicitly harmful,
**Weaknesses** - The paper largely ignores comparisons with a line of closely related work, beginning with "Language models are Homer Simpson! Safety re-alignment of fine-tuned language models through task arithmetic," (R1) which also introduced the concept of a *safety vector*. Therefore, I do not agree with the claim, *"we are the first to consider locating safety vectors and then restoring the safety of fine-tuned LLMs using an inference-time intervention method,"* as there are a series of s
1. They implement the method of ITI [1] on the safety of LLM and verify the effectiveness of the safety head. 2. Experiment results show the existence of safety heads and can efficiently enhance models' safety on different models. [1] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, NeurIPS 2023
1. The method is not novel. It totally follows the method in ITI [1] without considering the properties of safety itself, which influence the accuracy of the analysis. (1) As noted in the abstract (line 14), only 10 sentences can compromise the models' safety mechanisms. However, as illustrated in line 193, the safety components in LLMs are composed of three-fourths for both LLama3-8b and LLama3-70b. If the safety mechanism is easily breached, it may be due to a relatively small number of param
1. The main contribution of this paper is to point out an important safety issue in distributing LLMs: How can we guarantee the safety of an LLM after we release it, where users can use fine-tuning to bypass its safety alignment. Although the proposed solution is less satisfied to the reviewer (See Weakness 1.), the reviewer admit the value of the research question itself. 2. This paper provides insights into the inner machanism of LLMs on safety.
1. Regarding the research question. The paper aims to propose a *lock* to ensure that the safety behavior of an released LLM would not be jailbreaked. To this end, the major drawback of the proposed method is that: the safety lock is applied to the LLM after fine-tuning, which means that the fine-tuned LLM should still be preserved and served by its original provider. This is effective for proprietary LLMs such as GPT-4, where the model checkpoint is still kept by the company even after fine-tun
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNuclear and radioactivity studies
MethodsBalanced Selection · Sparse Evolutionary Training
