Guardian: Decoupling Exploration from Safety in Reinforcement Learning

Kaitong Cai; Jusheng Zhang; Jing Yang; Keze Wang

arXiv:2510.22859·cs.LG·October 28, 2025

Guardian: Decoupling Exploration from Safety in Reinforcement Learning

Kaitong Cai, Jusheng Zhang, Jing Yang, Keze Wang

PDF

4 Reviews

TL;DR

This paper introduces RLPD-GX, a framework that separates exploration from safety in hybrid offline-online reinforcement learning, leading to more stable training, better exploration, and state-of-the-art results on Atari-100k.

Contribution

We propose a decoupled safety enforcement framework that stabilizes hybrid RL by separating exploration from safety, with theoretical convergence guarantees and empirical state-of-the-art performance.

Findings

01

Achieved a 45% improvement over prior methods on Atari-100k.

02

Demonstrated stable training with safety guarantees across tasks.

03

Validated the generality of decoupled safety in various RL settings.

Abstract

Hybrid offline--online reinforcement learning (O2O RL) promises both sample efficiency and robust exploration, but suffers from instability due to distribution shift between offline and online data. We introduce RLPD-GX, a framework that decouples policy optimization from safety enforcement: a reward-seeking learner explores freely, while a projection-based guardian guarantees rule-consistent execution and safe value backups. This design preserves the exploratory value of online interactions without collapsing to conservative policies. To further stabilize training, we propose dynamic curricula that gradually extend temporal horizons and anneal offline--online data mixing. We prove convergence via a contraction property of the guarded Bellman operator, and empirically show state-of-the-art performance on Atari-100k, achieving a normalized mean score of 3.02 (+45\% over prior hybrid…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- Clean modularization of safety vs. optimization. The Guardian executes projected actions and the critic backs up only over the safe set; this neatly avoids gradient conflicts from penalty-based CMDP formulations. (See Eq. (6), Eq. (9)–(10), and Algorithm 1.) - Provable convergence under the guarded operator. The contraction of T_\Pi (Def. 1/Theorem 1) ensures value-iteration stability when restricting to A_{\text{safe}}(s). While standard, it formalizes the setting. - Empirical coverage and

Weaknesses

- Assumed safety oracle & projection practicality. The method presumes a binary predicate g(s,a) and non-empty safe sets for all states (Appendix B assumptions). For Atari this may be hand-engineered; for general continuous domains it is unclear how to construct A_{\text{safe}}(s) or compute the projection efficiently/accurately (e.g., discrete action spaces with an L_2 projection). Please clarify how the “Safety Mapping Matrix” in Fig. 1 is built and scaled. - Theory is mostly standard. The co

Reviewer 02Rating 2Confidence 4

Strengths

- RLPD-GX has the best overall score on the Atari 100k benchmark, but this result is invalid as it is the product of a flawed setup.

Weaknesses

- The paper’s main comparison is flawed. It applies a Safety component (the Guardian) to Atari-100k, a benchmark without safety constraints, and compares against baselines that aren’t Safe RL methods. The results report only scores, not safety metrics, so they don’t measure safety at all. This makes the evaluation unfair: adding rules that prevent bad actions, then claiming higher returns than unconstrained baselines. - The Guardian g(s,a) is under-specified: the paper does not clearly state the

Reviewer 03Rating 2Confidence 4

Strengths

- The paper cleanly separates reward seeking action selection from safety enforcement at both execution and backup time. This removes conflicting gradients between reward and safety and is easy to implement on top of standard actor critic code. -The guarded max operator proof is standard but correct and communicates the intended fixed point well for the constrained problem. -The table reports broad gains over offline, online, and hybrid baselines, and ablations indicate the Guardian is the do

Weaknesses

- The paper assumes an external binary safety predicate g(s,a) for each state action pair, with examples like not jumping toward monsters in Q bert, but it does not specify how these rules are generated or verified at scale on Atari. It is unclear if g is learned, scripted, or uses privileged features. The feasibility and cost of such a predicate are central to the claim of generality. - The Guardian uses the nearest projection in action space. In Atari, the action set is discrete and unordered

Reviewer 04Rating 2Confidence 4

Strengths

1. The experiments and baselines for comparison are extensive. 2. The writing of the paper is clear.

Weaknesses

1. The action projection relies on the safe action set, which is defined by a predicate $g$. However, how to obtain this predicate is never introduced. Why there is always a permissible action in a state, as indicated by Eq. (3), is also not explained. 2. The action projection method is closely related to safety filter methods in safe RL, see [1] for a comprehensive review. Also, the theory of guarded value iteration is closed related to feasible policy iteration proposed by [2]. Novelty of this

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.