Nearly Horizon-Free Offline Reinforcement Learning
Tongzheng Ren, Jialian Li, Bo Dai, Simon S. Du, Sujay Sanghavi

TL;DR
This paper establishes nearly horizon-free sample complexity bounds for offline reinforcement learning in episodic MDPs, significantly reducing dependency on the horizon length and improving theoretical guarantees.
Contribution
It provides the first nearly horizon-free bounds for offline RL in episodic tabular and linear MDPs, with a novel recursion-based analysis method.
Findings
Error bound for offline policy evaluation matches lower bounds up to logs
Sub-optimality gap for policy optimization approaches lower bounds
Introduces a recursion-based method for variance bounding in offline RL
Abstract
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with states and actions, or linear MDP with anchor points and feature dimension , given the collected episodes data with minimum visiting probability of (anchor) state-action pairs , we obtain nearly horizon -free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by . Specifically: 1. For offline policy evaluation, we obtain an error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on in higher-order term. 2.For offline policy optimization, we obtain an …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Machine Learning and Algorithms
