Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

Guopeng Li; Matthijs T.J. Spaan; Julian F.P. Kooij

arXiv:2603.23889·cs.LG·March 26, 2026

Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

Guopeng Li, Matthijs T.J. Spaan, Julian F.P. Kooij

PDF

Open Access

TL;DR

This paper introduces COX-Q, an off-policy safe reinforcement learning algorithm that combines cost-aware exploration and distributional value estimation to improve safety and efficiency in safety-critical tasks.

Contribution

The paper proposes a novel off-policy safe RL method, COX-Q, integrating cost-bounded exploration and distributional critics for better safety and sample efficiency.

Findings

01

COX-Q achieves high sample efficiency in safety-critical tasks.

02

COX-Q maintains competitive safety performance during testing.

03

The method effectively controls data collection costs.

Abstract

When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Autonomous Vehicle Technology and Safety