SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware   Decoding

Zhangchen Xu; Fengqing Jiang; Luyao Niu; Jinyuan Jia; Bill Yuchen Lin,; Radha Poovendran

arXiv:2402.08983·cs.CR·July 29, 2024·3 cites

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin,, Radha Poovendran

PDF

Open Access 1 Repo 1 Datasets

TL;DR

SafeDecoding is a novel safety-aware decoding method that enhances large language models' resistance to jailbreak attacks by amplifying safety disclaimers and reducing unsafe outputs, thereby improving safety without sacrificing helpfulness.

Contribution

This paper introduces SafeDecoding, a new decoding strategy that effectively defends against jailbreak attacks by leveraging safety disclaimers during token probability adjustments.

Findings

01

Significantly reduces jailbreak success rate

02

Outperforms six existing defense methods

03

Maintains helpfulness of responses to benign queries

Abstract

As large language models (LLMs) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align LLM behavior with human values, including safety. Jailbreak attacks, aiming to provoke unintended and unsafe behaviors from LLMs, remain a significant/leading LLM safety threat. In this paper, we aim to defend LLMs against jailbreak attacks by introducing SafeDecoding, a safety-aware decoding strategy for LLMs to generate helpful and harmless responses to user queries. Our insight in developing SafeDecoding is based on the observation that, even though probabilities of tokens representing harmful contents outweigh those representing harmless responses, safety disclaimers still appear among the top tokens after sorting tokens by probability in descending order. This allows us to mitigate jailbreak attacks by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uw-nsl/safedecoding
pytorchOfficial

Datasets

UWNSL/SafeDecoding-Attackers
dataset· 156 dl
156 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Advanced Malware Detection Techniques

MethodsALIGN