HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models
Sidhant Narula, Javad Rafiei Asl, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi

TL;DR
HarmNet is a modular framework that enhances multi-turn jailbreak attacks on large language models by systematically exploring and refining adversarial strategies, achieving significantly higher success rates than existing methods.
Contribution
HarmNet introduces a novel, adaptive, multi-component framework for more effective jailbreak attacks on LLMs, outperforming current state-of-the-art techniques.
Findings
HarmNet achieves a 99.4% attack success rate on Mistral-7B.
HarmNet outperforms existing methods by 13.9% in success rate.
The framework effectively uncovers stealthy attack paths.
Abstract
Large Language Models (LLMs) remain vulnerable to multi-turn jailbreak attacks. We introduce HarmNet, a modular framework comprising ThoughtNet, a hierarchical semantic network; a feedback-driven Simulator for iterative query refinement; and a Network Traverser for real-time adaptive attack execution. HarmNet systematically explores and refines the adversarial space to uncover stealthy, high-success attack paths. Experiments across closed-source and open-source LLMs show that HarmNet outperforms state-of-the-art methods, achieving higher attack success rates. For example, on Mistral-7B, HarmNet achieves a 99.4% attack success rate, 13.9% higher than the best baseline. Index terms: jailbreak attacks; large language models; adversarial framework; query refinement.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Graph Neural Networks
