TL;DR
TROJail introduces a multi-turn reinforcement learning approach with process rewards to improve the success of multi-turn jailbreak attacks on large language models, addressing the limitations of turn-level optimization.
Contribution
It formulates multi-turn jailbreak attacks as a reinforcement learning problem and introduces process rewards to enhance attack effectiveness, outperforming existing methods.
Findings
Improved attack success rates across multiple models.
Effective use of process rewards to guide attack strategies.
Enhanced ability to generate targeted harmful content.
Abstract
Large language models have seen widespread adoption, yet they remain vulnerable to multi-turn jailbreak attacks, threatening their safe deployment. This has led to the task of training automated multi-turn attackers to probe model safety vulnerabilities. However, existing approaches typically rely on turn-level optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate this task as a multi-turn reinforcement learning problem, directly optimizing the harmfulness of the final-turn response as the outcome reward. To address the sparse supervision of the outcome reward, we introduce TROJail, which employs two process rewards to evaluate the utility of intermediate prompts and integrate them into advantage estimation. These rewards (1) penalize overly harmful prompts that trigger the model's refusal mechanism, and (2) encourage steering the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
