Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization
Daniil Tiapkin (CMAP, LMO), Evgenii Chzhen (CELESTE, LMO), Gilles, Stoltz (LMO, CELESTE)

TL;DR
This paper introduces APO-MVP, an algorithm that significantly reduces regret in adversarial MDPs by bridging the gap with stochastic MDPs, using policy optimization techniques that are simple to implement.
Contribution
The paper presents APO-MVP, a novel policy optimization algorithm that achieves near-optimal regret bounds in adversarial MDPs without relying on occupancy measures.
Findings
Achieves regret bound of ( ext{poly}(H)\u221A{SAT})
Improves regret bounds by a factor of ( ext{poly}(H) ext{poly}(S,A,T))
Matches the minimax lower bound (( ext{poly}(H) ext{poly}(S,A,T)))
Abstract
We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during episodes, each of which consists of stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order , where and are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of , bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound as far as the dependencies in are concerned. The proposed algorithm and analysis completely avoid the typical tool given by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
