Safe Policy Improvement Approaches and their Limitations

Philipp Scholl; Felix Dietrich; Clemens Otte; Steffen Udluft

arXiv:2208.00724·cs.LG·August 2, 2022·1 cites

Safe Policy Improvement Approaches and their Limitations

Philipp Scholl, Felix Dietrich, Clemens Otte, Steffen Udluft

PDF

Open Access 1 Repo

TL;DR

This paper critically examines Safe Policy Improvement algorithms in offline reinforcement learning, revealing limitations in existing methods and proposing new algorithms with provable safety guarantees, supported by extensive experiments.

Contribution

It identifies flaws in Soft-SPIBB safety claims, introduces Adv-Soft-SPIBB algorithms with proven safety, and demonstrates the practical limitations of safety bounds in real data scenarios.

Findings

01

Soft-SPIBB safety claims are invalid.

02

Adv-Soft-SPIBB algorithms are provably safe.

03

Heuristic Lower-Approx-Soft-SPIBB performs best in experiments.

Abstract

Safe Policy Improvement (SPI) is an important technique for offline reinforcement learning in safety critical applications as it improves the behavior policy with a high probability. We classify various SPI approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. Focusing on the Soft-SPIBB (Safe Policy Improvement with Soft Baseline Bootstrapping) algorithms, we show that their claim of being provably safe does not hold. Based on this finding, we develop adaptations, the Adv-Soft-SPIBB algorithms, and show that they are provably safe. A heuristic adaptation, Lower-Approx-Soft-SPIBB, yields the best performance among all SPIBB algorithms in extensive experiments on two benchmarks. We also check the safety guarantees of the provably safe algorithms and show that huge amounts of data are necessary such that the safety bounds become…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

philipp238/safe-policy-improvement-approaches-on-discrete-markov-decision-processes
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Software Reliability and Analysis Research · Formal Methods in Verification