Defending Large Language Models Against Jailbreak Attacks via   Layer-specific Editing

Wei Zhao; Zhe Li; Yige Li; Ye Zhang; Jun Sun

arXiv:2405.18166·cs.AI·June 17, 2024

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Wei Zhao, Zhe Li, Yige Li, Ye Zhang, Jun Sun

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Layer-specific Editing (LED), a novel method that enhances large language models' resilience against jailbreak attacks by realigning safety layers within the models, effectively reducing harmful prompt responses.

Contribution

The paper proposes a new defense technique called LED that targets inner safety layers of LLMs, improving robustness against jailbreak prompts without sacrificing performance.

Findings

01

LED significantly reduces jailbreak success rates across multiple LLMs.

02

Realigning safety layers improves model safety while maintaining benign prompt performance.

03

Extensive experiments validate LED's effectiveness and generalizability.

Abstract

Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs response to harmful prompts and propose a novel defense method termed \textbf{L}ayer-specific \textbf{Ed}iting (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical \textit{safety layers} exist among the early layers of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ledllm/ledllm
pytorchOfficial

Videos

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Privacy-Preserving Technologies in Data

MethodsFocus