Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

Prajwal Koirala; Zhanhong Jiang; Soumik Sarkar; Cody Fleming

arXiv:2412.08794·cs.LG·February 11, 2026

Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, Cody Fleming

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a novel offline RL method that learns safety-constrained policies using latent safety modeling and reward optimization, ensuring safety and high performance in complex scenarios.

Contribution

It proposes a latent safety-constrained policy framework using Conditional Variational Autoencoders and reward-Advantage Weighted Regression, with theoretical guarantees and superior empirical results.

Findings

01

Outperforms existing methods in safety and reward metrics

02

Maintains safety compliance in autonomous driving scenarios

03

Provides theoretical bounds on policy performance

Abstract

In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints, utilizing only offline data. Traditional methods often face difficulties in balancing these constraints, leading to either diminished performance or increased safety risks. We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders, which model the latent safety constraints. Subsequently, we frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints. This is achieved by training an encoder with a reward-Advantage Weighted Regression objective within the latent constraint space. Our methodology is supported by theoretical…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The authors provide a theoretical analysis of their method. - The proposed method outperforms baselines in safe RL tasks.

Weaknesses

- The method proposed by this paper cannot strictly keep safe constraints. Since a state-action pair is safe iff $ V_c^\pi(s)\leq \kappa $. Equation 10 only serves as a regularizer to penalize the high-cost actions and cannot guarantee $ V_c^\pi(s)\leq \kappa $. - Theory insights are one of the contributions of this work. However, the authors do not derive an exact sample complexity; they only claim that the sample complexity for obtaining an $ \epsilon $-optimal policy is about $ \mathcal{O}(1/

Reviewer 02Rating 6Confidence 3

Strengths

Originality: The LSPC framework offers a fresh perspective on managing the trade-off between safety and rewards, particularly within the context of offline reinforcement learning, demonstrating significant practical and theoretical value. Significance: The paper provides a detailed and persuasive theoretical analysis of policy performance, ensuring the scientific foundation of the proposed method. Quality: Through systematic experimentation across various benchmark tasks, the effectiveness

Weaknesses

Please refer to questions

Reviewer 03Rating 6Confidence 4

Strengths

1. Balance of safety and performance: Designed to optimize rewards while satisfying safety constraints. 2. Efficiency of offline learning: Learns policies without further interactions with the environment by using fixed datasets. 3. Theoretical stability guarantee: Offers theoretical bounds on policy performance and sample complexity. 4. Applicability in various environments: Demonstrates strong performance in complex tasks, including autonomous driving and robotic manipulation.

Weaknesses

1. Difficulty in latent space configuration: Finding the optimal latent space configuration requires empirical tuning. 2. Dependency on dataset quality: Safe policy learning heavily relies on the quality of the dataset. 3. Limitations in real-time application: Offline RL struggles to respond immediately to dynamic changes in real-time environments.

Code & Models

Repositories

PrajwalKoirala/LSPC-Safe-Offline-RL
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Software Reliability and Analysis Research · Reinforcement Learning in Robotics