Safety Layers in Aligned Large Language Models: The Key to LLM Security

Shen Li; Liuyi Yao; Lan Zhang; Yaliang Li

arXiv:2408.17003·cs.CR·April 8, 2025·2 cites

Safety Layers in Aligned Large Language Models: The Key to LLM Security

Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper identifies specific internal layers in aligned large language models that are crucial for security, and proposes a fine-tuning method that preserves security without sacrificing performance.

Contribution

It uncovers the existence and location of safety layers within LLMs and introduces SPPFT, a fine-tuning technique that maintains security during model updates.

Findings

01

Safety layers are located in the middle of the model.

02

SPPFT preserves security and performance during fine-tuning.

03

The method reduces computational resources needed for fine-tuning.

Abstract

Aligned LLMs are secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining such security is not well understood yet, further these models can be vulnerable to security degradation when subjected to fine-tuning attacks. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as ``safety layers". We first confirm the existence of these safety layers by analyzing variations in input vectors within the model's internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on these findings, we propose a novel…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The identification of a specific set of middle layers in aligned LLMs as key to recognizing malicious inputs is an intriguing finding that provides a new perspective on model robustness. - The paper presents clear and effective visualizations to support the findings. - Experiments demonstrate the method’s efficacy.

Weaknesses

- The type of attacks addressed by the proposed method is unclear. While related work discusses jailbreak, the experiments primarily use backdoor datasets. The paper lacks a defined threat model and discussion of the defender's capabilities and objectives. - Since the method requires access to model parameters, it is not suitable for popular black-box LLMs, which limits its applicability.

Reviewer 02Rating 6Confidence 3

Strengths

- The authors provide an insightful perspective on analyzing LLM safety. - The paper is clearly presented and well-structured. - The proposed SPPFT approach is shown to be effective.

Weaknesses

- The conclusion lacks rigor; differences in vectors may be explained by distributional variations. - The assumption that certain layers, rather than individual neurons within each layer, are related to safety requires more clarification.

Reviewer 03Rating 6Confidence 4

Strengths

1. Extensive analysis is made to jusify the existence of safety layers. 2. Defense solution is very simple. 3. Paper is well written and easy to read.

Weaknesses

* The defense can only be applied to backdoor attack, making its application very narrow. It is unkonwn whether the method can be extended general harmful fine-tuning attack sceanrios (in which a percentage of the harmful data (with no trigger in the question) is mixed in the fine-tuning process), I guess the answer is yes, but the authors should demonstrate this with experiments. Moreover, I personally don't think the backdoor attack for safety unalignment very reasonable (see my question). R

Code & Models

Repositories

listen0425/safety-layers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Cloud Data Security Solutions

MethodsSparse Evolutionary Training