Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning

Zhaoqi Wang; Zijian Zhang; Daqing He; Pengtao Kou; Xin Li; Jiamou Liu; Jincheng An; and Yong Liu

arXiv:2601.05466·cs.CR·January 12, 2026

Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning

Zhaoqi Wang, Zijian Zhang, Daqing He, Pengtao Kou, Xin Li, Jiamou Liu, Jincheng An, and Yong Liu

PDF

Open Access

TL;DR

This paper introduces iMIST, an adaptive multi-step attack method that disguises malicious prompts as normal tool use and escalates harm through interactive dialogues, exposing vulnerabilities in current LLM safety defenses.

Contribution

The paper presents iMIST, a novel attack technique combining tool-disguised prompts and progressive escalation, revealing weaknesses in existing LLM safety measures.

Findings

01

iMIST achieves higher attack success rates.

02

It maintains low rejection rates during attacks.

03

Reveals critical vulnerabilities in current defenses.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across diverse applications, however, they remain critically vulnerable to jailbreak attacks that elicit harmful responses violating human values and safety guidelines. Despite extensive research on defense mechanisms, existing safeguards prove insufficient against sophisticated adversarial strategies. In this work, we propose iMIST (\underline{i}nteractive \underline{M}ulti-step \underline{P}rogre\underline{s}sive \underline{T}ool-disguised Jailbreak Attack), a novel adaptive jailbreak method that synergistically exploits vulnerabilities in current defense mechanisms. iMIST disguises malicious queries as normal tool invocations to bypass content filters, while simultaneously introducing an interactive progressive optimization algorithm that dynamically escalates response harmfulness through multi-turn dialogues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling