Asymptotically optimal regret in communicating Markov decision processes

Victor Boone

arXiv:2505.18064·cs.LG·May 26, 2025

Asymptotically optimal regret in communicating Markov decision processes

Victor Boone

PDF

TL;DR

This paper introduces a learning algorithm for communicating Markov decision processes that achieves asymptotically optimal regret by explicitly estimating a key constant, balancing exploration and exploitation effectively.

Contribution

The paper presents a novel algorithm that attains optimal regret in average reward MDPs by tracking the constant K(M) and addresses the challenge of its discontinuity with a regularization method.

Findings

01

Achieves regret of K(M) log(T) + o(log(T)) in communicating MDPs.

02

Develops a regularization mechanism to estimate K(M) accurately.

03

Demonstrates the discontinuity of the function K(M).

Abstract

In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process $M$ , our algorithm has regret $K (M) lo g (T) + o (lo g (T))$ where $T$ is the number of learning steps and $K (M)$ is the best possible constant. This algorithm works by explicitly tracking the constant $K (M)$ to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function $K (M)$ is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate $K (M)$ with arbitrary precision from empirical data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.