Narrowing the Gap between Adversarial and Stochastic MDPs via Policy   Optimization

Daniil Tiapkin (CMAP; LMO); Evgenii Chzhen (CELESTE; LMO); Gilles; Stoltz (LMO; CELESTE)

arXiv:2407.05704·cs.LG·March 6, 2025

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Daniil Tiapkin (CMAP, LMO), Evgenii Chzhen (CELESTE, LMO), Gilles, Stoltz (LMO, CELESTE)

PDF

TL;DR

This paper introduces APO-MVP, an algorithm that significantly reduces regret in adversarial MDPs by bridging the gap with stochastic MDPs, using policy optimization techniques that are simple to implement.

Contribution

The paper presents APO-MVP, a novel policy optimization algorithm that achieves near-optimal regret bounds in adversarial MDPs without relying on occupancy measures.

Findings

01

Achieves regret bound of ( ext{poly}(H)\u221A{SAT})

02

Improves regret bounds by a factor of ( ext{poly}(H) ext{poly}(S,A,T))

03

Matches the minimax lower bound (( ext{poly}(H) ext{poly}(S,A,T)))

Abstract

We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during $T$ episodes, each of which consists of $H$ stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order $\tilde{O} (poly (H) S A T)$ , where $S$ and $A$ are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of $S$ , bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound $Ω (H^{3} S A T)$ as far as the dependencies in $S, A, T$ are concerned. The proposed algorithm and analysis completely avoid the typical tool given by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.