Generalization Limits of Reinforcement Learning Alignment
Haruhi Shida, Koo Imai, and Keigo Kansa

TL;DR
This paper investigates the limitations of reinforcement learning-based alignment in large language models, demonstrating that safety techniques do not fully generalize and proposing compound jailbreaks to expose these vulnerabilities.
Contribution
It introduces compound jailbreak techniques that combine multiple attacks to reveal generalization failures in LLM safety alignment.
Findings
Attack success rate increased from 14.3% to 71.4% with combined methods.
Safety training does not generalize as broadly as model capabilities.
Highlights the need for multifaceted safety evaluations.
Abstract
The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
