Analyzing the Inherent Response Tendency of LLMs: Real-World   Instructions-Driven Jailbreak

Yanrui Du; Sendong Zhao; Ming Ma; Yuhan Chen; Bing Qin

arXiv:2312.04127·cs.CL·February 26, 2024·2 cites

Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak

Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, Bing Qin

PDF

Open Access 1 Repo

TL;DR

This paper introduces RADIAL, an automatic jailbreak method that exploits the inherent response tendencies of LLMs to generate harmful responses, revealing significant security vulnerabilities across multiple languages.

Contribution

The study proposes a novel analysis of LLMs' inherent response tendencies and develops a real-world instructions-driven jailbreak strategy that effectively bypasses safety mechanisms.

Findings

01

High attack success rate on open-source LLMs

02

Effective cross-language attack performance

03

Highlights potential risks of LLMs' inherent response tendencies

Abstract

Extensive work has been devoted to improving the safety mechanism of Large Language Models (LLMs). However, LLMs still tend to generate harmful responses when faced with malicious instructions, a phenomenon referred to as "Jailbreak Attack". In our research, we introduce a novel automatic jailbreak method RADIAL, which bypasses the security mechanism by amplifying the potential of LLMs to generate affirmation responses. The jailbreak idea of our method is "Inherent Response Tendency Analysis" which identifies real-world instructions that can inherently induce LLMs to generate affirmation responses and the corresponding jailbreak strategy is "Real-World Instructions-Driven Jailbreak" which involves strategically splicing real-world instructions identified through the above analysis around the malicious instruction. Our method achieves excellent attack performance on English malicious…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dyr1/mogu
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques