Provable Self-Play Algorithms for Competitive Reinforcement Learning

Yu Bai; Chi Jin

arXiv:2002.04017·cs.LG·July 10, 2020·30 cites

Provable Self-Play Algorithms for Competitive Reinforcement Learning

Yu Bai, Chi Jin

PDF

Open Access 1 Video

TL;DR

This paper introduces provably sample-efficient self-play algorithms for competitive reinforcement learning in Markov games, providing theoretical guarantees on regret bounds and addressing exploration challenges.

Contribution

It presents the first provably sample-efficient self-play algorithms with regret guarantees in competitive RL, including a polynomial-time explore-then-exploit method.

Findings

01

VI-ULCB achieves $ ilde{O}( oot{T} ext{})$ regret against adversarial opponents.

02

An explore-then-exploit algorithm attains $ ilde{O}(T^{2/3})$ regret with polynomial runtime.

03

This work is the first to provide theoretical guarantees for self-play in competitive RL.

Abstract

Self-play, where the algorithm learns by playing against itself without requiring any direct supervision, has become the new weapon in modern Reinforcement Learning (RL) for achieving superhuman performance in practice. However, the majority of exisiting theory in reinforcement learning only applies to the setting where the agent plays against a fixed environment; it remains largely open whether self-play algorithms can be provably effective, especially when it is necessary to manage the exploration/exploitation tradeoff. We study self-play in competitive reinforcement learning under the setting of Markov games, a generalization of Markov decision processes to the two-player case. We introduce a self-play algorithm---Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)---and show that it achieves regret $\tilde{O} (T)$ after playing $T$ steps of the game, where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Provable Self-Play Algorithms for Competitive Reinforcement Learning· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Scheduling and Optimization Algorithms · Advanced Bandit Algorithms Research