Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings
Ming Yin, Yu-Xiang Wang

TL;DR
This paper establishes optimal statistical bounds for uniform offline policy evaluation and extends model-based offline reinforcement learning to task-agnostic and reward-free settings, achieving near-optimal complexity.
Contribution
It introduces a unified framework with sharp analysis tools like singleton absorbing MDPs for optimal uniform OPE and extends to new offline RL settings with minimal complexity.
Findings
Established lower bound $rac{H^2 S}{d_m \e^2}$ for global uniform OPE.
Achieved upper bound $ ilde{O}(rac{H^2}{d_m \e^2})$ for local uniform convergence.
Extended framework to offline task-agnostic and reward-free RL with optimal complexity.
Abstract
This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE is a stronger measure than the point-wise OPE and ensures offline learning when contains all policies (the global class). In this paper, we establish an lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of for the \emph{local} uniform convergence that applies to all \emph{near-empirically optimal} policies for the MDPs with \emph{stationary} transition. Here is the minimal marginal state-action probability. Critically, the highlight in achieving the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems
