Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret   Learning in Markov Games

Wenhao Zhan; Jason D. Lee; Zhuoran Yang

arXiv:2206.01588·cs.LG·June 6, 2022

Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games

Wenhao Zhan, Jason D. Lee, Zhuoran Yang

PDF

Open Access 1 Video

TL;DR

This paper introduces DORIS, a decentralized optimistic hyperpolicy mirror descent algorithm that achieves no-regret learning in Markov games with nonstationary opponents, ensuring convergence to equilibrium under certain conditions.

Contribution

The paper proposes DORIS, a novel algorithm for decentralized no-regret learning in Markov games with function approximation, and proves its effectiveness and equilibrium convergence.

Findings

01

Achieves -regret in general function approximation settings.

02

Mixture policy of all agents forms an approximate coarse correlated equilibrium.

03

Applicable to constrained and vector-valued MDPs modeled as zero-sum Markov games.

Abstract

We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result \citep{liu2022learning}, we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, \underline{D}ecentralized \underline{O}ptimistic hype\underline{R}policy m\underline{I}rror de\underline{S}cent (DORIS), which achieves $K$ -regret in the context of general function approximation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Smart Grid Energy Management