Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models
Songze Li, Ruishi He, Xiaojun Jia, Jun Wang, Zhihui Fu

TL;DR
This paper introduces Mastermind, a dynamic, self-improving multi-turn jailbreak framework for large language models that enhances attack success and resilience by autonomous planning, reflection, and knowledge refinement.
Contribution
It presents a novel hierarchical, knowledge-driven approach for multi-turn jailbreaks that adapts and refines attack strategies through autonomous interaction and reflection.
Findings
Mastermind achieves higher attack success rates than existing methods.
It demonstrates robustness against advanced defense mechanisms.
The framework effectively refines attack knowledge over multiple interactions.
Abstract
Large Language Models (LLMs) face a significant threat from multi-turn jailbreak attacks, where adversaries progressively steer conversations to elicit harmful outputs. However, the practical effectiveness of existing attacks is undermined by several critical limitations: they struggle to maintain a coherent progression over long interactions, often losing track of what has been accomplished and what remains to be done; they rely on rigid or pre-defined patterns, and fail to adapt to the LLM's dynamic and unpredictable conversational state. To address these shortcomings, we introduce Mastermind, a multi-turn jailbreak framework that adopts a dynamic and self-improving approach. Mastermind operates in a closed loop of planning, execution, and reflection, enabling it to autonomously build and refine its knowledge of model vulnerabilities through interaction. It employs a hierarchical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
