Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking
Zhida He, Xiaoyu Wen, Han Qi, Ziyuan Zhou, Peng Yu, Xingcheng Xu, Dongrui Liu, Xia Hu, Chaochao Lu, Qiaosheng Zhang

TL;DR
This paper introduces TRACE, a turn-aware credit assignment framework for multi-turn jailbreaking of LLMs, improving attack success and safety-utility balance by addressing the non-uniform contribution of dialogue turns.
Contribution
The paper proposes a novel turn-level credit assignment method for RL-based multi-turn jailbreaking, enhancing effectiveness and safety in attack and defense scenarios.
Findings
TRACE achieves about 25% relative improvement in attack success rate.
It improves safety-utility balance when reused for defense.
Extensive experiments validate TRACE's effectiveness, transferability, and efficiency.
Abstract
Deploying LLMs in multi-turn dialogues facilitates jailbreak attacks that distribute harmful intent across seemingly benign turns. Recent training-based multi-turn jailbreak methods learn long-horizon attack strategies from interaction feedback, but often rely on coarse trajectory-level outcome signals that broadcast uniformly to every turn. However, we find that turn-level contributions in multi-turn jailbreaking are non-uniform, phase-dependent, and target-specific. Such coarse outcome supervision induces a credit assignment problem, leading to over-rewarding redundant turns in successful trajectories and under-crediting useful intermediate turns in failed ones. To address this, we propose TRACE, a turn-aware credit assignment framework for reinforcement learning (RL)-based multi-turn jailbreaking. For successful trajectories, TRACE estimates turn-level contributions via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
