Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP
Jinghan Wang, Mengdi Wang, Lin F. Yang

TL;DR
This paper establishes near sample-optimal bounds for policy learning in average reward MDPs using a reduction to discounted MDPs, improving previous mixing-time-based results and matching lower bounds.
Contribution
It introduces a reduction from average reward MDPs to discounted MDPs, enabling new sample complexity bounds that are nearly optimal and improve upon prior mixing-time assumptions.
Findings
Upper bound of (H \u00b7 ps^3 ) samples per state-action pair.
Lower bound of (|| || H ) total samples, matching the upper bound.
Reduction technique from AMDP to discounted MDPs enables application of DMDP algorithms.
Abstract
This work considers the sample complexity of obtaining an -optimal policy in an average reward Markov Decision Process (AMDP), given access to a generative model (simulator). When the ground-truth MDP is weakly communicating, we prove an upper bound of samples per state-action pair, where is the span of bias of any optimal policy, is the accuracy and is the failure probability. This bound improves the best-known mixing-time-based approaches in [Jin & Sidford 2021], which assume the mixing-time of every deterministic policy is bounded. The core of our analysis is a proper reduction bound from AMDP problems to discounted MDP (DMDP) problems, which may be of independent interests since it allows the application of DMDP algorithms for AMDP in other settings. We complement our upper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms
