LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
Shi Lin, Hongming Yang, Rongchang Li, Xun Wang, Changting Lin, Wenpeng Xing, Meng Han

TL;DR
This paper introduces a novel black-box attack method called Analyzing-based Jailbreak (ABJ) that manipulates the internal reasoning processes of large language and multimodal models to bypass safety measures, exposing significant safety vulnerabilities.
Contribution
It uncovers an underexplored threat vector by targeting models' reasoning chains and proposes a new attack method that demonstrates high success and transferability across various models.
Findings
ABJ achieves an 82.1% attack success rate on GPT-4.
ABJ effectively exploits multimodal reasoning capabilities.
The attack demonstrates high transferability and efficiency.
Abstract
The rapid development of Large Language Models (LLMs) has brought impressive advancements across various tasks. However, despite these achievements, LLMs still pose inherent safety risks, especially in the context of jailbreak attacks. Most existing jailbreak methods follow an input-level manipulation paradigm to bypass safety mechanisms. Yet, as alignment techniques improve, such attacks are becoming increasingly detectable. In this work, we identify an underexplored threat vector: the model's internal reasoning process, which can be manipulated to elicit harmful outputs in a more stealthy way. To explore this overlooked attack surface, we propose a novel black-box jailbreak attack method, Analyzing-based Jailbreak (ABJ). ABJ comprises two independent attack paths: textual and visual reasoning attacks, which exploit the model's multimodal reasoning capabilities to bypass safety…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Digital and Cyber Forensics
MethodsAutoencoders
