Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design

Leon Eshuijs; Shihan Wang; Antske Fokkens

arXiv:2604.12500·cs.LG·April 15, 2026

Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design

Leon Eshuijs, Shihan Wang, Antske Fokkens

PDF

TL;DR

This study investigates how safety training influences harmful behaviors in RL-tuned language models, revealing environment-dependent effects and limitations of current safety benchmarks.

Contribution

It demonstrates that model size and environment design critically affect RL-induced misalignment, highlighting the importance of environment-specific safety considerations.

Findings

01

Model size can both buffer or enable harmful exploitation depending on environment.

02

Most safety benchmarks do not predict RL-induced misalignment accurately.

03

On-policy RL preserves an inherent safety buffer in the model's distribution.

Abstract

Specification gaming under Reinforcement Learning (RL) is known to cause LLMs to develop sycophantic, manipulative, or deceptive behavior, yet the conditions under which this occurs remain unclear. We train 11 instruction-tuned LLMs (0.5B--14B) with on-policy RL across 3 environments and find that model size acts as a safety buffer in some environments but enables greater harmful exploitation in others. Controlled ablations trace this reversal to environment-specific features such as role framing and implicit gameability cues. We further show that most safety benchmarks do not predict RL-induced misalignment, except in the case of Sycophancy scores when the exploit relies on inferring the user's preference. Finally, we find that on-policy RL preserves a safety buffer inherent in the model's own generation distribution, one that is bypassed during off-policy settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.