Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation   for Reinforcement Learning

Ming Yin; Yu Bai; Yu-Xiang Wang

arXiv:2007.03760·cs.LG·December 2, 2020·24 cites

Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning

Ming Yin, Yu Bai, Yu-Xiang Wang

PDF

Open Access

TL;DR

This paper establishes nearly optimal uniform convergence bounds for offline policy evaluation in reinforcement learning, enabling reliable evaluation of all policies in a class and advancing theoretical understanding of offline RL.

Contribution

It introduces the first systematic analysis of uniform convergence in OPE, achieving nearly optimal error bounds for policy classes and demonstrating optimal episode complexity for model-based planning.

Findings

01

Achieves nearly optimal error bounds for uniform convergence in OPE.

02

Demonstrates optimal episode complexity for identifying epsilon-optimal policies.

03

First to systematically investigate uniform convergence in offline policy evaluation.

Abstract

The problem of Offline Policy Evaluation (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications. Existing work on OPE mostly focus on evaluating a fixed target policy $π$ , which does not provide useful bounds for offline policy learning as $π$ will then be data-dependent. We address this problem by simultaneously evaluating all policies in a policy class $Π$ -- uniform convergence in OPE -- and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of $O (H^{3} / d_{m} ϵ^{2})$ in identifying an $ϵ$ -optimal policy under the time-inhomogeneous episodic MDP model ( $H$ is the planning horizon, $d_{m}$ is a quantity that reflects the exploration of the logging policy $μ$ ). To the best of our knowledge, this is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Advanced Bandit Algorithms Research