SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints

Dominik Wagner; Ankit Kanwar; Luke Ong

arXiv:2512.23770·cs.LG·May 11, 2026

SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints

Dominik Wagner, Ankit Kanwar, Luke Ong

PDF

TL;DR

SB-TRPO is a new reinforcement learning algorithm designed to satisfy strict safety constraints while optimizing task rewards, balancing safety and performance effectively.

Contribution

Introduces SB-TRPO, a principled method for hard-constrained RL that dynamically balances safety and reward improvements with formal guarantees.

Findings

01

SB-TRPO achieves the best safety-performance trade-off in Safety Gymnasium tasks.

02

The method guarantees local safety progress while improving rewards.

03

Experiments show consistent safety and task performance balance.

Abstract

In safety-critical domains, reinforcement learning (RL) agents must often satisfy strict, zero-cost safety constraints while accomplishing tasks. Existing model-free methods frequently either fail to achieve near-zero safety violations or become overly conservative. We introduce Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a principled algorithm for hard-constrained RL that dynamically balances cost reduction with reward improvement. At each step, SB-TRPO updates via a dynamic convex combination of the reward and cost natural policy gradients, ensuring a fixed fraction of optimal cost reduction while using remaining update capacity for reward improvement. Our method comes with formal guarantees of local progress on safety, while still improving reward whenever gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.