Safety Layers in Aligned Large Language Models: The Key to LLM Security
Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li

TL;DR
This paper identifies specific internal layers in aligned large language models that are crucial for security, and proposes a fine-tuning method that preserves security without sacrificing performance.
Contribution
It uncovers the existence and location of safety layers within LLMs and introduces SPPFT, a fine-tuning technique that maintains security during model updates.
Findings
Safety layers are located in the middle of the model.
SPPFT preserves security and performance during fine-tuning.
The method reduces computational resources needed for fine-tuning.
Abstract
Aligned LLMs are secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining such security is not well understood yet, further these models can be vulnerable to security degradation when subjected to fine-tuning attacks. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as ``safety layers". We first confirm the existence of these safety layers by analyzing variations in input vectors within the model's internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on these findings, we propose a novel…
Peer Reviews
Decision·ICLR 2025 Poster
- The identification of a specific set of middle layers in aligned LLMs as key to recognizing malicious inputs is an intriguing finding that provides a new perspective on model robustness. - The paper presents clear and effective visualizations to support the findings. - Experiments demonstrate the method’s efficacy.
- The type of attacks addressed by the proposed method is unclear. While related work discusses jailbreak, the experiments primarily use backdoor datasets. The paper lacks a defined threat model and discussion of the defender's capabilities and objectives. - Since the method requires access to model parameters, it is not suitable for popular black-box LLMs, which limits its applicability.
- The authors provide an insightful perspective on analyzing LLM safety. - The paper is clearly presented and well-structured. - The proposed SPPFT approach is shown to be effective.
- The conclusion lacks rigor; differences in vectors may be explained by distributional variations. - The assumption that certain layers, rather than individual neurons within each layer, are related to safety requires more clarification.
1. Extensive analysis is made to jusify the existence of safety layers. 2. Defense solution is very simple. 3. Paper is well written and easy to read.
* The defense can only be applied to backdoor attack, making its application very narrow. It is unkonwn whether the method can be extended general harmful fine-tuning attack sceanrios (in which a percentage of the harmful data (with no trigger in the question) is mixed in the fine-tuning process), I guess the answer is yes, but the authors should demonstrate this with experiments. Moreover, I personally don't think the backdoor attack for safety unalignment very reasonable (see my question). R
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Cloud Data Security Solutions
MethodsSparse Evolutionary Training
