Optimal Sample Complexity for Average Reward Markov Decision Processes

Shengbo Wang; Jose Blanchet; and Peter Glynn

arXiv:2310.08833·cs.LG·February 14, 2024·1 cites

Optimal Sample Complexity for Average Reward Markov Decision Processes

Shengbo Wang, Jose Blanchet, and Peter Glynn

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper presents a new algorithm for policy learning in average reward Markov decision processes that achieves near-optimal sample complexity, closing the gap between existing upper and lower bounds.

Contribution

We develop the first estimator for average reward MDPs that matches the theoretical lower bound on sample complexity, improving upon previous methods.

Findings

01

Our algorithm attains the optimal sample complexity of rac{1}{2}(|S||A|t_{mix}\u00b2 \u2206^{-2})

02

Numerical experiments confirm the theoretical efficiency of the proposed method.

03

The approach bridges the gap between upper and lower bounds in sample complexity for average reward MDPs.

Abstract

We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $O (∣ S ∣∣ A ∣ t_{mix}^{2} ϵ^{- 2})$ and a lower bound of $Ω (∣ S ∣∣ A ∣ t_{mix} ϵ^{- 2})$ . In these expressions, $∣ S ∣$ and $∣ A ∣$ denote the cardinalities of the state and action spaces respectively, $t_{mix}$ serves as a uniform upper limit for the total variation mixing times, and $ϵ$ signifies the error tolerance. Therefore, a notable gap of $t_{mix}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

1. The paper addresses a significant issue in the domain of Markov Decision Processes, providing a solution to the sample complexity associated with maximizing the long-term average reward. This is a valuable contribution that could potentially advance understanding and application in this area. 2. By enhancing pre-existing results by a factor of the mixing time and aligning with the established lower bound, the paper provides a comparative analysis that underscores the improvements made and the

Weaknesses

1. The algorithm is primarily a synthesis of methodologies from Jin & Sidford (2021) and Li et al. (2020). While this approach has its merits, the novelty of this paper is somewhat limited given its dependence on previous works. 2. The paper could be enhanced by placing greater emphasis on the challenges addressed by the study and the innovative aspects of the proposed algorithm. Highlighting these elements would help to showcase the unique contributions of the paper and further establish its si

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

- First minimax optimal guarantees for AMDPs under a generative model assumptions; - As a byproduct, authors provide minimax optimal for ergodic DMDPs - Computationally feasible algorithm; - Simplicity of the presented approach.

Weaknesses

- All the main instruments has already introduced in other papers, and thus this paper may lack of novelty. - Reduction of AMDP to DMDP is presented in (Jin & Sidford, 2021) - Optimal rates with optimal warm-up are presented in (Li et al. 2020); - Rates for mixing DMDP are already presented in (Wang et al. 2023) (specifically, Proposition 6.1 and Corollary 6.2.1); Yujia Jin and Aaron Sidford. Towards tight bounds on the sample complexity of average-reward MDPs, 2021. Gen Li, Yuting

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. Theoretical paper that gives a matching bound, therefore fully establishing the optimal sample complexity for AMDP under the uniform ergodicity condition. 2. The background was explained clearly, and the context of the result to other related settings is also well explained. 3. In most places, the notations and proofs are done rigorously, more so than the average papers.

Weaknesses

1. The result is somewhat thin in the sense that it feels like filling a small gap that was somehow overlooked by several previous groups of researchers, though I personally like the cleanness of the result. 2. The main contribution is technical, yet the main paper does not really spend the effort to clearly explain the technical critical point that enables the authors to establish the bound. Particularly, it appears Proposition A.1 is the critical step to establish the concentration inequality

Videos

Optimal Sample Complexity for Average Reward Markov Decision Processes· slideslive

Taxonomy

TopicsBayesian Modeling and Causal Inference