LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models

Shi Lin; Hongming Yang; Rongchang Li; Xun Wang; Changting Lin; Wenpeng Xing; Meng Han

arXiv:2407.16205·cs.CR·June 19, 2025

LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models

Shi Lin, Hongming Yang, Rongchang Li, Xun Wang, Changting Lin, Wenpeng Xing, Meng Han

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel black-box attack method called Analyzing-based Jailbreak (ABJ) that manipulates the internal reasoning processes of large language and multimodal models to bypass safety measures, exposing significant safety vulnerabilities.

Contribution

It uncovers an underexplored threat vector by targeting models' reasoning chains and proposes a new attack method that demonstrates high success and transferability across various models.

Findings

01

ABJ achieves an 82.1% attack success rate on GPT-4.

02

ABJ effectively exploits multimodal reasoning capabilities.

03

The attack demonstrates high transferability and efficiency.

Abstract

The rapid development of Large Language Models (LLMs) has brought impressive advancements across various tasks. However, despite these achievements, LLMs still pose inherent safety risks, especially in the context of jailbreak attacks. Most existing jailbreak methods follow an input-level manipulation paradigm to bypass safety mechanisms. Yet, as alignment techniques improve, such attacks are becoming increasingly detectable. In this work, we identify an underexplored threat vector: the model's internal reasoning process, which can be manipulated to elicit harmful outputs in a more stealthy way. To explore this overlooked attack surface, we propose a novel black-box jailbreak attack method, Analyzing-based Jailbreak (ABJ). ABJ comprises two independent attack paths: textual and visual reasoning attacks, which exploit the model's multimodal reasoning capabilities to bypass safety…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

theshi-1128/ABJ-Attack
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Digital and Cyber Forensics

MethodsAutoencoders