Safe Policy Improvement with Baseline Bootstrapping
Romain Laroche, Paul Trichelair, R\'emi Tachet des Combes

TL;DR
This paper introduces SPIBB, a safe policy improvement method in batch reinforcement learning that guarantees baseline performance, utilizing a bootstrapping approach based on uncertainty, and demonstrates its effectiveness in various domains including deep RL.
Contribution
The paper proposes SPIBB, a novel safe policy improvement algorithm with theoretical guarantees, practical variants, and a model-free deep RL implementation that outperforms existing methods in safety and performance.
Findings
SPIBB guarantees baseline performance in batch RL.
SPIBB outperforms existing algorithms in safety and mean performance.
Deep RL version SPIBB-DQN trains efficiently without environment interaction.
Abstract
This paper considers Safe Policy Improvement (SPI) in Batch Reinforcement Learning (Batch RL): from a fixed dataset and without direct access to the true environment, train a policy that is guaranteed to perform at least as well as the baseline policy used to collect the data. Our approach, called SPI with Baseline Bootstrapping (SPIBB), is inspired by the knows-what-it-knows paradigm: it bootstraps the trained policy with the baseline when the uncertainty is high. Our first algorithm, -SPIBB, comes with SPI theoretical guarantees. We also implement a variant, -SPIBB, that is even more efficient in practice. We apply our algorithms to a motivational stochastic gridworld domain and further demonstrate on randomly generated MDPs the superiority of SPIBB with respect to existing algorithms, not only in safety but also in mean performance. Finally, we implement a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security
