Safe-Support Q-Learning: Learning without Unsafe Exploration

Yeeun Lim; Narim Jeong; Donghwan Lee

arXiv:2604.25379·cs.LG·April 29, 2026

Safe-Support Q-Learning: Learning without Unsafe Exploration

Yeeun Lim, Narim Jeong, Donghwan Lee

PDF

TL;DR

This paper introduces a safe reinforcement learning framework that ensures no unsafe states are visited during training by using a behavior policy supported on a safe set and a KL-regularized Bellman target.

Contribution

It proposes a novel Q-learning-based safe RL method that guarantees safety during training without sacrificing exploration within the safe set.

Findings

01

Achieves stable learning and well-calibrated value estimates.

02

Yields safer behavior with comparable or better performance than baselines.

03

Supports different action spaces and behavior policies.

Abstract

Ensuring safety during reinforcement learning (RL) training is critical in real-world applications where unsafe exploration can lead to devastating outcomes. While most safe RL methods mitigate risk through constraints or penalization, they still allow exploration of unsafe states during training. In this work, we adopt a stricter safety requirement that eliminates unsafe state visitation during training. To achieve this goal, we propose a Q-learning-based safe RL framework that leverages a behavior policy supported on a safe set. Under the assumption that the induced trajectories remain within the safe set, this policy enables sufficient exploration within the safe region without requiring near-optimality. We adopt a two-stage framework in which the Q-function and policy are trained separately. Specifically, we introduce a KL-regularized Bellman target that constrains the Q-function to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.