GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication

Hua Tang; Lingyong Yan; Yukun Zhao; Shuaiqiang Wang; Jizhou Huang; Dawei Yin

arXiv:2506.17881·cs.CL·September 30, 2025

GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication

Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin

PDF

Open Access 3 Reviews

TL;DR

This paper introduces GRAF, a novel multi-turn jailbreaking method for large language models that refines attack strategies globally and fabricates responses to better induce harmful outputs, outperforming existing methods.

Contribution

GRAF is the first approach to globally refine attack trajectories and actively fabricate responses in multi-turn jailbreaking, enhancing effectiveness against state-of-the-art LLMs.

Findings

01

GRAF outperforms existing jailbreaking methods across six LLMs.

02

Global refinement improves attack success rate.

03

Active fabrication increases likelihood of harmful output elicitation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks. Nevertheless, they still pose notable safety risks due to potential misuse for malicious purposes. Jailbreaking, which seeks to induce models to generate harmful content through single-turn or multi-turn attacks, plays a crucial role in uncovering underlying security vulnerabilities. However, prior methods, including sophisticated multi-turn approaches, often struggle to adapt to the evolving dynamics of dialogue as interactions progress. To address this challenge, we propose \ours (JailBreaking via \textbf{G}lobally \textbf{R}efining and \textbf{A}daptively \textbf{F}abricating), a novel multi-turn jailbreaking method that globally refines the attack trajectory at each interaction. In addition, we actively fabricate model responses to suppress safety-related warnings, thereby increasing the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- The authors report relatively strong, although not SOTA, attack success rates on HarmBench.

Weaknesses

- There are stronger baselines for this problem. For instance, the strongest baselines I'm aware of is X-teaming (https://arxiv.org/pdf/2504.13203), which reports stronger results than this paper. - I'm also curious as to how the authors tuned/adjusted their baselines? I'd expect an algorithm like PAIR to do much better for this problem, particularly as this method seems to be "PAIR but for multi-turn jailbreaking." - For a similar reason, it's unclear what conceptual insight might be valuable t

Reviewer 02Rating 6Confidence 3

Strengths

1. Motivation is clear. 2. The method is straightforward. GRAF adheres to the standard multi-turn pipeline (initial trajectory + attacker-driven refinement) and adds two light-weight mechanisms (global refinement; active fabrication) without auxiliary models or heavy hyperparameter tuning. This simplicity is appealing for reproducibility. 3. The main results are significant. Under GPT-Judge, GRAF outperforms both single-turn and multi-turn baselines across six targets (Table 1). The paper als

Weaknesses

1. Side effects of discarding (qi, ai) pairs are under-analyzed. Section 3.3 asserts that dropping pairs helps downstream acceptance, but the paper does not quantify its frequency or impact on on-topic coherence and end-task success. Removed turns may encode semantic glue that carries the malicious intent; deleting them could derail or inadvertently sanitize the trajectory. 2. Section 5.1 seems off-topic. The representation study (Figure 3) argues that more history turns can shift harmful queri

Reviewer 03Rating 4Confidence 4

Strengths

The paper is well-written and easy to follow. The motivation for a multi-turn jailbreak attack is sound and clear. The idea of globally optimizing queries along the trajectory is interesting.

Weaknesses

- My biggest concern lies in the generalizability of this method with the initialization of attack queries. Even through Sec 5.2 shows results of different initialization methods, the method itself is still based on the existing attack method. This requirement seems to be too strong due to the dependence of other attack methods. - In the global refinement, modifying the future sequence of queries does not make sense to me, since the attacker can only have previous history queries. If the method

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntegrated Circuits and Semiconductor Failure Analysis · Advanced Surface Polishing Techniques · VLSI and Analog Circuit Testing