TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

Xiqiao Xiong; Ouxiang Li; Zhuo Liu; Moxin Li; Wentao Shi; Fengbin Zhu; Qifan Wang; Fuli Feng

arXiv:2512.07761·cs.AI·April 22, 2026

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fengbin Zhu, Qifan Wang, Fuli Feng

PDF

1 Repo

TL;DR

TROJail introduces a multi-turn reinforcement learning approach with process rewards to improve the success of multi-turn jailbreak attacks on large language models, addressing the limitations of turn-level optimization.

Contribution

It formulates multi-turn jailbreak attacks as a reinforcement learning problem and introduces process rewards to enhance attack effectiveness, outperforming existing methods.

Findings

01

Improved attack success rates across multiple models.

02

Effective use of process rewards to guide attack strategies.

03

Enhanced ability to generate targeted harmful content.

Abstract

Large language models have seen widespread adoption, yet they remain vulnerable to multi-turn jailbreak attacks, threatening their safe deployment. This has led to the task of training automated multi-turn attackers to probe model safety vulnerabilities. However, existing approaches typically rely on turn-level optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate this task as a multi-turn reinforcement learning problem, directly optimizing the harmfulness of the final-turn response as the outcome reward. To address the sparse supervision of the outcome reward, we introduce TROJail, which employs two process rewards to evaluate the utility of intermediate prompts and integrate them into advantage estimation. These rewards (1) penalize overly harmful prompts that trigger the model's refusal mechanism, and (2) encourage steering the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xxiqiao/TROJail
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.