Safe Policy Improvement with Soft Baseline Bootstrapping
Kimia Nadjahi, Romain Laroche, R\'emi Tachet des Combes

TL;DR
This paper introduces a safer and less conservative policy improvement method in batch reinforcement learning by allowing controlled risk-taking based on local model uncertainty, improving performance guarantees.
Contribution
It extends the SPIBB algorithm with a softer, uncertainty-based policy constraint, enabling broader policy exploration while maintaining safety guarantees.
Findings
Significant performance improvements over existing SPI algorithms.
Effective in both finite and infinite MDPs with neural network approximation.
Provides provable safety guarantees with a less conservative approach.
Abstract
Batch Reinforcement Learning (Batch RL) consists in training a policy using trajectories collected with another policy, called the behavioural policy. Safe policy improvement (SPI) provides guarantees with high probability that the trained policy performs better than the behavioural policy, also called baseline in this setting. Previous work shows that the SPI objective improves mean performance as compared to using the basic RL objective, which boils down to solving the MDP with maximum likelihood. Here, we build on that work and improve more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies. Instead of binarily classifying the state-action pairs into two sets (the \textit{uncertain} and the \textit{safe-to-train-on} ones), we adopt a softer strategy that controls the error in the value estimates by constraining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Reinforcement Learning in Robotics
