Optimal Uniform OPE and Model-based Offline Reinforcement Learning in   Time-Homogeneous, Reward-Free and Task-Agnostic Settings

Ming Yin; Yu-Xiang Wang

arXiv:2105.06029·cs.LG·June 25, 2021

Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

Ming Yin, Yu-Xiang Wang

PDF

Open Access 1 Video

TL;DR

This paper establishes optimal statistical bounds for uniform offline policy evaluation and extends model-based offline reinforcement learning to task-agnostic and reward-free settings, achieving near-optimal complexity.

Contribution

It introduces a unified framework with sharp analysis tools like singleton absorbing MDPs for optimal uniform OPE and extends to new offline RL settings with minimal complexity.

Findings

01

Established lower bound $rac{H^2 S}{d_m \e^2}$ for global uniform OPE.

02

Achieved upper bound $ ilde{O}(rac{H^2}{d_m \e^2})$ for local uniform convergence.

03

Extended framework to offline task-agnostic and reward-free RL with optimal complexity.

Abstract

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE $sup_{Π} ∣ Q^{π} - \hat{Q}^{π} ∣ < ϵ$ is a stronger measure than the point-wise OPE and ensures offline learning when $Π$ contains all policies (the global class). In this paper, we establish an $Ω (H^{2} S / d_{m} ϵ^{2})$ lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of $\tilde{O} (H^{2} / d_{m} ϵ^{2})$ for the \emph{local} uniform convergence that applies to all \emph{near-empirically optimal} policies for the MDPs with \emph{stationary} transition. Here $d_{m}$ is the minimal marginal state-action probability. Critically, the highlight in achieving the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems