Finding good policies in average-reward Markov Decision Processes without prior knowledge
Adrienne Tuynman, R\'emy Degenne, Emilie Kaufmann

TL;DR
This paper introduces a new algorithm for identifying near-optimal policies in average-reward MDPs without prior knowledge, achieving near-optimal sample complexity in the PAC setting and establishing fundamental limits in the online setting.
Contribution
It presents the first H-agnostic PAC algorithm for average-reward MDPs and provides lower bounds and improved online algorithms with data-dependent stopping rules.
Findings
The PAC algorithm has sample complexity scaling as SAD/ε^2, near the theoretical lower bound.
A lower bound shows polynomial dependence on H is impossible in the online setting.
An online algorithm with sample complexity SAD^2/ε^2 is proposed, with potential for further improvement.
Abstract
We revisit the identification of an -optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, , and the optimal bias span, , which satisfy . Prior work have studied the complexity of -optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with for which the sample complexity to output an -optimal policy is where and are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order has been proposed, but it requires the knowledge of . We first show that the sample complexity required to estimate is not bounded by any function of and , ruling out the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReceptor Mechanisms and Signaling · Formal Methods in Verification
