A Systematic Investigation of The RL-Jailbreaker in LLMs
Montaser Mohammedalamen, Kevin Roice, Reginald McLean, Alyssa Lefaivre \v{S}kopac

TL;DR
This paper systematically analyzes RL-based jailbreaking of LLMs, revealing that environment formalization, especially dense rewards and longer episodes, is key to successful adversarial attacks.
Contribution
It provides the first detailed decomposition of RL jailbreaking, identifying structural factors that influence attack success and offering insights to improve model defenses.
Findings
RL jailbreaking successfully compromised all targeted models.
Dense rewards and longer episodes are primary drivers of success.
Mechanistic understanding aids in developing more robust defenses.
Abstract
The evolution of generative models from next-token predictors to autonomous engines of complex systems necessitates rigorous safety hardening. Adversarial jailbreaking, the strategic manipulation of models to elicit harmful output, remains a primary threat to safe deployment. While Reinforcement Learning (RL) frames jailbreaking as a multi-step attack through sequential optimization, a mechanistic understanding of why the framework succeeds remains incomplete. To fill this gap, we present the first systematic decomposition of RL jailbreaking. We deconstruct the framework into problem formalization (reward function, action space, episode length), and algorithmic measures (RL algorithm, training data, reward-shaping) to identify the structural determinants of adversarial success. Our results reveal that the RL-jailbreaker successfully compromised all targeted models and safeguards.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
