Decision-Point Guided Safe Policy Improvement
Abhishek Sharma, Leo Benac, Sonali Parbhoo, Finale Doshi-Velez

TL;DR
This paper introduces DPRL, a safe policy improvement algorithm for batch reinforcement learning that focuses on decision points to ensure high-confidence improvements while effectively managing risk in sparse data regions.
Contribution
DPRL is a novel algorithm that restricts policy improvements to decision points, providing tighter, data-dependent bounds and ensuring safety and performance in diverse datasets.
Findings
DPRL guarantees high-confidence improvements at decision points.
DPRL achieves tighter bounds that do not scale with state-action space size.
DPRL demonstrates safety and performance on synthetic and real datasets.
Abstract
Within batch reinforcement learning, safe policy improvement (SPI) seeks to ensure that the learnt policy performs at least as well as the behavior policy that generated the dataset. The core challenge in SPI is seeking improvements while balancing risk when many state-action pairs may be infrequently visited. In this work, we introduce Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states (i.e. decision points) while still utilizing data from sparsely visited states. By appropriately limiting where and how we may deviate from the behavior policy, we achieve tighter bounds than prior work; specifically, our data-dependent bounds do not scale with the size of the state and action spaces. In addition to the analysis, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
