Safe Exploration via Policy Priors

Manuel Wendl; Yarden As; Manish Prajapat; Anton Pollak; Stelian Coros; Andreas Krause

arXiv:2601.19612·cs.LG·February 10, 2026

Safe Exploration via Policy Priors

Manuel Wendl, Yarden As, Manish Prajapat, Anton Pollak, Stelian Coros, Andreas Krause

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SOOPER, a safe reinforcement learning method that leverages conservative policy priors and probabilistic models to ensure safety and convergence during online learning, validated through experiments and theoretical analysis.

Contribution

The paper presents SOOPER, a novel safe RL algorithm that guarantees safety and convergence using conservative priors and probabilistic models, with proven theoretical bounds and empirical validation.

Findings

01

SOOPER guarantees safety during learning.

02

It outperforms state-of-the-art safe RL methods.

03

Experimental results validate theoretical guarantees.

Abstract

Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

+ Strong analytical results by combining always-safe learning with sublinear cumulative regret. + The pessimistic-termination MDP approach converts a constrained problem into a standard RL one. + Unification of optimism, pessimism, and expansion through a single intrinsic-reward objective. + Empirically validated on diverse continuous-control tasks and real hardware, not only simulators.

Weaknesses

- The proof seems to implicitly assume Lipschitz continuity of the uncertainty estimate $\sigma_n,$ which is not formally stated. I think this assumption is required for several bounds. - Assumption 4 requires the prior policy to be at least as safe as the optimal one for all states, which seems very strong.

Reviewer 02Rating 6Confidence 4

Strengths

1. While the combination of offline training and online exploration is not entirely new, the paper demonstrates novelty through its theoretical development, particularly in providing safety and optimality guarantees. 2. The empirical evaluations are comprehensive and effectively support the theoretical findings, illustrating the method’s applicability to practical scenarios. 3. The paper is well written and easy to follow, even in the theoretical sections, which are presented with clarity an

Weaknesses

1. In the Introduction, the authors state that their theoretical results hold under regularity assumptions, and in the “Optimality” subsection of Related Works, they claim to relax some assumptions from prior studies. However, in Section 3 (Problem Setting), Assumption 1 regarding Gaussian noise appears rather restrictive. Moreover, the assumption that the transition dynamics follow $ s_{t+1} = f(s_t, a_t) + \omega_t $ is quite strong; a more general formulation such as \( s_{t+1} = f(s_t, a_t

Reviewer 03Rating 8Confidence 5

Strengths

1. This paper is easy to follow and the motivations behind the proposed algorithm are clearly presented. The paper provides a well-motivated discussion of the need for safe exploration in reinforcement learning, emphasizing the trade-off between conservatism and exploration. 1. The relevant literature has been well-covered and I do not find missing references. 1. The proposed algorithm is technically sound. 1. Theoretical results are rigorous, providing both safety guarantees and a sublinear

Weaknesses

1. The theoretical analysis relies on several strong assumptions (e.g., Gaussian noise, Lipschitz continuity, bounded RKHS norms, and the existence of a pessimistic policy prior that satisfies safety for all plausible dynamics). While such assumptions are sometimes seen in other safe RL literature, they may limit the practical applicability of the theoretical results. 1. Although the paper claims that SOOPER can be implemented on top of standard model-based methods, the overall system involves

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning