Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models

Songze Li; Ruishi He; Xiaojun Jia; Jun Wang; Zhihui Fu

arXiv:2601.05445·cs.CR·January 12, 2026

Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models

Songze Li, Ruishi He, Xiaojun Jia, Jun Wang, Zhihui Fu

PDF

Open Access

TL;DR

This paper introduces Mastermind, a dynamic, self-improving multi-turn jailbreak framework for large language models that enhances attack success and resilience by autonomous planning, reflection, and knowledge refinement.

Contribution

It presents a novel hierarchical, knowledge-driven approach for multi-turn jailbreaks that adapts and refines attack strategies through autonomous interaction and reflection.

Findings

01

Mastermind achieves higher attack success rates than existing methods.

02

It demonstrates robustness against advanced defense mechanisms.

03

The framework effectively refines attack knowledge over multiple interactions.

Abstract

Large Language Models (LLMs) face a significant threat from multi-turn jailbreak attacks, where adversaries progressively steer conversations to elicit harmful outputs. However, the practical effectiveness of existing attacks is undermined by several critical limitations: they struggle to maintain a coherent progression over long interactions, often losing track of what has been accomplished and what remains to be done; they rely on rigid or pre-defined patterns, and fail to adapt to the LLM's dynamic and unpredictable conversational state. To address these shortcomings, we introduce Mastermind, a multi-turn jailbreak framework that adopts a dynamic and self-improving approach. Mastermind operates in a closed loop of planning, execution, and reflection, enabling it to autonomously build and refine its knowledge of model vulnerabilities through interaction. It employs a hierarchical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection