Don't Say No: Jailbreaking LLM by Suppressing Refusal

Yukai Zhou; Jian Lou; Zhijie Huang; Zhan Qin; Yibei Yang; Wenjie Wang

arXiv:2404.16369·cs.CL·July 3, 2025·1 cites

Don't Say No: Jailbreaking LLM by Suppressing Refusal

Yukai Zhou, Jian Lou, Zhijie Huang, Zhan Qin, Yibei Yang, Wenjie Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces DSN, a novel attack method that effectively jailbreaks LLMs by suppressing refusal, outperforming existing methods in success rate and transferability across models and datasets.

Contribution

The study identifies limitations of existing attack methods and proposes DSN, a new approach combining cosine decay and refusal suppression to enhance attack success and universality.

Findings

01

DSN achieves higher success rates than baseline attacks.

02

DSN demonstrates strong transferability to unseen datasets and black-box models.

03

The proposed enhancements improve the effectiveness of jailbreaking LLMs.

Abstract

Ensuring the safety alignment of Large Language Models (LLMs) is critical for generating responses consistent with human values. However, LLMs remain vulnerable to jailbreaking attacks, where carefully crafted prompts manipulate them into producing toxic content. One category of such attacks reformulates the task as an optimization problem, aiming to elicit affirmative responses from the LLM. However, these methods heavily rely on predefined objectionable behaviors, limiting their effectiveness and adaptability to diverse harmful queries. In this study, we first identify why the vanilla target loss is suboptimal and then propose enhancements to the loss objective. We introduce DSN (Don't Say No) attack, which combines a cosine decay schedule method with refusal suppression to achieve higher success rates. Extensive experiments demonstrate that DSN outperforms baseline attacks and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dsn-2024/dsn
pytorchOfficial

Videos

Don't Say No: Jailbreaking LLM by Suppressing Refusal· underline

Taxonomy

TopicsLegal Systems and Judicial Processes · Criminal Law and Evidence · Law, AI, and Intellectual Property