Provable Self-Play Algorithms for Competitive Reinforcement Learning
Yu Bai, Chi Jin

TL;DR
This paper introduces provably sample-efficient self-play algorithms for competitive reinforcement learning in Markov games, providing theoretical guarantees on regret bounds and addressing exploration challenges.
Contribution
It presents the first provably sample-efficient self-play algorithms with regret guarantees in competitive RL, including a polynomial-time explore-then-exploit method.
Findings
VI-ULCB achieves $ ilde{O}( oot{T} ext{})$ regret against adversarial opponents.
An explore-then-exploit algorithm attains $ ilde{O}(T^{2/3})$ regret with polynomial runtime.
This work is the first to provide theoretical guarantees for self-play in competitive RL.
Abstract
Self-play, where the algorithm learns by playing against itself without requiring any direct supervision, has become the new weapon in modern Reinforcement Learning (RL) for achieving superhuman performance in practice. However, the majority of exisiting theory in reinforcement learning only applies to the setting where the agent plays against a fixed environment; it remains largely open whether self-play algorithms can be provably effective, especially when it is necessary to manage the exploration/exploitation tradeoff. We study self-play in competitive reinforcement learning under the setting of Markov games, a generalization of Markov decision processes to the two-player case. We introduce a self-play algorithm---Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)---and show that it achieves regret after playing steps of the game, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Scheduling and Optimization Algorithms · Advanced Bandit Algorithms Research
