Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL

Matthew Zurek; Guy Zamir; Yudong Chen

arXiv:2506.20904·cs.LG·April 23, 2026

Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL

Matthew Zurek, Guy Zamir, Yudong Chen

PDF

1 Video

TL;DR

This paper establishes the first fully single-policy sample complexity bounds for offline reinforcement learning in average-reward MDPs, addressing distribution shift and coverage issues with a novel algorithm and theoretical guarantees.

Contribution

It introduces the first single-policy sample complexity bounds for average-reward offline RL, handling weakly communicating MDPs with a new algorithm and analysis.

Findings

01

Developed sharp guarantees depending only on the target policy's bias span and hitting radius.

02

Proposed a novel pessimistic value iteration algorithm with quantile clipping.

03

Established lower bounds nearly matching the main results.

Abstract

We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage, and has been relatively underexamined from a theoretical perspective. While previous work obtains performance guarantees under single-policy data coverage assumptions, such guarantees utilize additional complexity measures which are uniform over all policies, such as the uniform mixing time. We develop sharp guarantees depending only on the target policy, specifically the bias span and a novel policy hitting radius, yielding the first fully single-policy sample complexity bound for average-reward offline RL. We are also the first to handle general weakly communicating MDPs, contrasting restrictive structural assumptions made in prior work. To achieve this, we introduce an algorithm based on pessimistic discounted value…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL· slideslive