Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning
Ming Yin, Yu Bai, Yu-Xiang Wang

TL;DR
This paper establishes nearly optimal uniform convergence bounds for offline policy evaluation in reinforcement learning, enabling reliable evaluation of all policies in a class and advancing theoretical understanding of offline RL.
Contribution
It introduces the first systematic analysis of uniform convergence in OPE, achieving nearly optimal error bounds for policy classes and demonstrating optimal episode complexity for model-based planning.
Findings
Achieves nearly optimal error bounds for uniform convergence in OPE.
Demonstrates optimal episode complexity for identifying epsilon-optimal policies.
First to systematically investigate uniform convergence in offline policy evaluation.
Abstract
The problem of Offline Policy Evaluation (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications. Existing work on OPE mostly focus on evaluating a fixed target policy , which does not provide useful bounds for offline policy learning as will then be data-dependent. We address this problem by simultaneously evaluating all policies in a policy class -- uniform convergence in OPE -- and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of in identifying an -optimal policy under the time-inhomogeneous episodic MDP model ( is the planning horizon, is a quantity that reflects the exploration of the logging policy ). To the best of our knowledge, this is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Advanced Bandit Algorithms Research
