Don't Say No: Jailbreaking LLM by Suppressing Refusal
Yukai Zhou, Jian Lou, Zhijie Huang, Zhan Qin, Yibei Yang, Wenjie Wang

TL;DR
This paper introduces DSN, a novel attack method that effectively jailbreaks LLMs by suppressing refusal, outperforming existing methods in success rate and transferability across models and datasets.
Contribution
The study identifies limitations of existing attack methods and proposes DSN, a new approach combining cosine decay and refusal suppression to enhance attack success and universality.
Findings
DSN achieves higher success rates than baseline attacks.
DSN demonstrates strong transferability to unseen datasets and black-box models.
The proposed enhancements improve the effectiveness of jailbreaking LLMs.
Abstract
Ensuring the safety alignment of Large Language Models (LLMs) is critical for generating responses consistent with human values. However, LLMs remain vulnerable to jailbreaking attacks, where carefully crafted prompts manipulate them into producing toxic content. One category of such attacks reformulates the task as an optimization problem, aiming to elicit affirmative responses from the LLM. However, these methods heavily rely on predefined objectionable behaviors, limiting their effectiveness and adaptability to diverse harmful queries. In this study, we first identify why the vanilla target loss is suboptimal and then propose enhancements to the loss objective. We introduce DSN (Don't Say No) attack, which combines a cosine decay schedule method with refusal suppression to achieve higher success rates. Extensive experiments demonstrate that DSN outperforms baseline attacks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLegal Systems and Judicial Processes · Criminal Law and Evidence · Law, AI, and Intellectual Property
