How Alignment and Jailbreak Work: Explain LLM Safety through   Intermediate Hidden States

Zhenhong Zhou; Haiyang Yu; Xinghua Zhang; Rongwu Xu; Fei Huang,; Yongbin Li

arXiv:2406.05644·cs.CL·June 14, 2024

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang,, Yongbin Li

PDF

Open Access 1 Repo

TL;DR

This paper explains how large language models learn safety concepts and how jailbreak techniques bypass safety measures by analyzing intermediate hidden states, providing insights into LLM safety mechanisms.

Contribution

It introduces a method using weak classifiers to interpret LLM safety through hidden states and reveals how jailbreaks disrupt ethical concept transformations.

Findings

01

LLMs learn ethical concepts during pre-training.

02

Early layers identify malicious inputs.

03

Jailbreaks interfere with ethical classification process.

Abstract

Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Unfortunately, jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content and raising concerns about LLM safety. Due to language models with intensive parameters often regarded as black boxes, the mechanisms of alignment and jailbreak are challenging to elucidate. In this paper, we employ weak classifiers to explain LLM safety through the intermediate hidden states. We first confirm that LLMs learn ethical concepts during pre-training rather than alignment and can identify malicious and normal inputs in the early layers. Alignment actually associates the early concepts with emotion guesses in the middle layers and then refines them to the specific reject tokens for safe generations. Jailbreak disturbs the transformation of early unethical classification into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ydyjya/llm-ihs-explanation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, Economics, and Judicial Systems · Law, AI, and Intellectual Property