Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning
Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, Cody Fleming

TL;DR
This paper introduces a novel offline RL method that learns safety-constrained policies using latent safety modeling and reward optimization, ensuring safety and high performance in complex scenarios.
Contribution
It proposes a latent safety-constrained policy framework using Conditional Variational Autoencoders and reward-Advantage Weighted Regression, with theoretical guarantees and superior empirical results.
Findings
Outperforms existing methods in safety and reward metrics
Maintains safety compliance in autonomous driving scenarios
Provides theoretical bounds on policy performance
Abstract
In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints, utilizing only offline data. Traditional methods often face difficulties in balancing these constraints, leading to either diminished performance or increased safety risks. We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders, which model the latent safety constraints. Subsequently, we frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints. This is achieved by training an encoder with a reward-Advantage Weighted Regression objective within the latent constraint space. Our methodology is supported by theoretical…
Peer Reviews
Decision·ICLR 2025 Poster
- The authors provide a theoretical analysis of their method. - The proposed method outperforms baselines in safe RL tasks.
- The method proposed by this paper cannot strictly keep safe constraints. Since a state-action pair is safe iff $ V_c^\pi(s)\leq \kappa $. Equation 10 only serves as a regularizer to penalize the high-cost actions and cannot guarantee $ V_c^\pi(s)\leq \kappa $. - Theory insights are one of the contributions of this work. However, the authors do not derive an exact sample complexity; they only claim that the sample complexity for obtaining an $ \epsilon $-optimal policy is about $ \mathcal{O}(1/
Originality: The LSPC framework offers a fresh perspective on managing the trade-off between safety and rewards, particularly within the context of offline reinforcement learning, demonstrating significant practical and theoretical value. Significance: The paper provides a detailed and persuasive theoretical analysis of policy performance, ensuring the scientific foundation of the proposed method. Quality: Through systematic experimentation across various benchmark tasks, the effectiveness
Please refer to questions
1. Balance of safety and performance: Designed to optimize rewards while satisfying safety constraints. 2. Efficiency of offline learning: Learns policies without further interactions with the environment by using fixed datasets. 3. Theoretical stability guarantee: Offers theoretical bounds on policy performance and sample complexity. 4. Applicability in various environments: Demonstrates strong performance in complex tasks, including autonomous driving and robotic manipulation.
1. Difficulty in latent space configuration: Finding the optimal latent space configuration requires empirical tuning. 2. Dependency on dataset quality: Safe policy learning heavily relies on the quality of the dataset. 3. Limitations in real-time application: Offline RL struggles to respond immediately to dynamic changes in real-time environments.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Software Reliability and Analysis Research · Reinforcement Learning in Robotics
