Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Dongkyu Derek Cho; Huan Song; Arijit Ghosh Chowdhury; Haotian An; Yawei Wang; Rohit Thekkanal; Negin Sokhandan; Sharlina Keshava; Hannah Marlowe

arXiv:2511.21050·cs.LG·November 27, 2025

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury, Haotian An, Yawei Wang, Rohit Thekkanal, Negin Sokhandan, Sharlina Keshava, Hannah Marlowe

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that reinforcement learning with verifiable rewards (RLVR) can improve reasoning in large language models while maintaining safety, challenging the common belief of an unavoidable safety-capability tradeoff.

Contribution

It provides the first theoretical and empirical analysis showing RLVR can enhance reasoning and safety simultaneously in LLMs, with bounds on safety drift and extensive benchmark validation.

Findings

01

RLVR can improve reasoning capabilities in LLMs.

02

RLVR maintains or enhances safety guardrails in adversarial benchmarks.

03

Theoretical bounds show conditions where safety degradation is eliminated.

Abstract

Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)