Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning
Zhaoqi Wang, Zijian Zhang, Daqing He, Pengtao Kou, Xin Li, Jiamou Liu, Jincheng An, and Yong Liu

TL;DR
This paper introduces iMIST, an adaptive multi-step attack method that disguises malicious prompts as normal tool use and escalates harm through interactive dialogues, exposing vulnerabilities in current LLM safety defenses.
Contribution
The paper presents iMIST, a novel attack technique combining tool-disguised prompts and progressive escalation, revealing weaknesses in existing LLM safety measures.
Findings
iMIST achieves higher attack success rates.
It maintains low rejection rates during attacks.
Reveals critical vulnerabilities in current defenses.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across diverse applications, however, they remain critically vulnerable to jailbreak attacks that elicit harmful responses violating human values and safety guidelines. Despite extensive research on defense mechanisms, existing safeguards prove insufficient against sophisticated adversarial strategies. In this work, we propose iMIST (\underline{i}nteractive \underline{M}ulti-step \underline{P}rogre\underline{s}sive \underline{T}ool-disguised Jailbreak Attack), a novel adaptive jailbreak method that synergistically exploits vulnerabilities in current defense mechanisms. iMIST disguises malicious queries as normal tool invocations to bypass content filters, while simultaneously introducing an interactive progressive optimization algorithm that dynamically escalates response harmfulness through multi-turn dialogues…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
