SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin,, Radha Poovendran

TL;DR
SafeDecoding is a novel safety-aware decoding method that enhances large language models' resistance to jailbreak attacks by amplifying safety disclaimers and reducing unsafe outputs, thereby improving safety without sacrificing helpfulness.
Contribution
This paper introduces SafeDecoding, a new decoding strategy that effectively defends against jailbreak attacks by leveraging safety disclaimers during token probability adjustments.
Findings
Significantly reduces jailbreak success rate
Outperforms six existing defense methods
Maintains helpfulness of responses to benign queries
Abstract
As large language models (LLMs) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align LLM behavior with human values, including safety. Jailbreak attacks, aiming to provoke unintended and unsafe behaviors from LLMs, remain a significant/leading LLM safety threat. In this paper, we aim to defend LLMs against jailbreak attacks by introducing SafeDecoding, a safety-aware decoding strategy for LLMs to generate helpful and harmless responses to user queries. Our insight in developing SafeDecoding is based on the observation that, even though probabilities of tokens representing harmful contents outweigh those representing harmless responses, safety disclaimers still appear among the top tokens after sorting tokens by probability in descending order. This allows us to mitigate jailbreak attacks by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Advanced Malware Detection Techniques
MethodsALIGN
