TL;DR
This paper establishes the first fully single-policy sample complexity bounds for offline reinforcement learning in average-reward MDPs, addressing distribution shift and coverage issues with a novel algorithm and theoretical guarantees.
Contribution
It introduces the first single-policy sample complexity bounds for average-reward offline RL, handling weakly communicating MDPs with a new algorithm and analysis.
Findings
Developed sharp guarantees depending only on the target policy's bias span and hitting radius.
Proposed a novel pessimistic value iteration algorithm with quantile clipping.
Established lower bounds nearly matching the main results.
Abstract
We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage, and has been relatively underexamined from a theoretical perspective. While previous work obtains performance guarantees under single-policy data coverage assumptions, such guarantees utilize additional complexity measures which are uniform over all policies, such as the uniform mixing time. We develop sharp guarantees depending only on the target policy, specifically the bias span and a novel policy hitting radius, yielding the first fully single-policy sample complexity bound for average-reward offline RL. We are also the first to handle general weakly communicating MDPs, contrasting restrictive structural assumptions made in prior work. To achieve this, we introduce an algorithm based on pessimistic discounted value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
