Generalization Limits of Reinforcement Learning Alignment

Haruhi Shida; Koo Imai; and Keigo Kansa

arXiv:2604.02652·cs.LG·April 6, 2026

Generalization Limits of Reinforcement Learning Alignment

Haruhi Shida, Koo Imai, and Keigo Kansa

PDF

TL;DR

This paper investigates the limitations of reinforcement learning-based alignment in large language models, demonstrating that safety techniques do not fully generalize and proposing compound jailbreaks to expose these vulnerabilities.

Contribution

It introduces compound jailbreak techniques that combine multiple attacks to reveal generalization failures in LLM safety alignment.

Findings

01

Attack success rate increased from 14.3% to 71.4% with combined methods.

02

Safety training does not generalize as broadly as model capabilities.

03

Highlights the need for multifaceted safety evaluations.

Abstract

The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.