Finding good policies in average-reward Markov Decision Processes   without prior knowledge

Adrienne Tuynman; R\'emy Degenne; Emilie Kaufmann

arXiv:2405.17108·cs.LG·May 28, 2024·1 cites

Finding good policies in average-reward Markov Decision Processes without prior knowledge

Adrienne Tuynman, R\'emy Degenne, Emilie Kaufmann

PDF

Open Access 1 Video

TL;DR

This paper introduces a new algorithm for identifying near-optimal policies in average-reward MDPs without prior knowledge, achieving near-optimal sample complexity in the PAC setting and establishing fundamental limits in the online setting.

Contribution

It presents the first H-agnostic PAC algorithm for average-reward MDPs and provides lower bounds and improved online algorithms with data-dependent stopping rules.

Findings

01

The PAC algorithm has sample complexity scaling as SAD/ε^2, near the theoretical lower bound.

02

A lower bound shows polynomial dependence on H is impossible in the online setting.

03

An online algorithm with sample complexity SAD^2/ε^2 is proposed, with potential for further improvement.

Abstract

We revisit the identification of an $ε$ -optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$ , and the optimal bias span, $H$ , which satisfy $H \leq D$ . Prior work have studied the complexity of $ε$ -optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with $D ≃ H$ for which the sample complexity to output an $ε$ -optimal policy is $Ω (S A D / ε^{2})$ where $S$ and $A$ are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order $S A H / ε^{2}$ has been proposed, but it requires the knowledge of $H$ . We first show that the sample complexity required to estimate $H$ is not bounded by any function of $S, A$ and $H$ , ruling out the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Finding good policies in average-reward Markov Decision Processes without prior knowledge· slideslive

Taxonomy

TopicsReceptor Mechanisms and Signaling · Formal Methods in Verification