A Sharp Analysis of Model-based Reinforcement Learning with Self-Play
Qinghua Liu, Tiancheng Yu, Yu Bai, Chi Jin

TL;DR
This paper provides a sharp analysis and improved sample complexity guarantees for model-based self-play algorithms in multi-agent Markov games, achieving near-optimal bounds and practical policy outputs.
Contribution
It introduces the Optimistic Nash Value Iteration algorithm with improved sample complexity for two-player zero-sum Markov games, matching theoretical lower bounds.
Findings
Achieves $ ilde{O}(H^3SAB/\epsilon^2)$ sample complexity
Improves over previous $ ilde{O}(H^4S^2AB/\epsilon^2)$ guarantees
First to match the information-theoretic lower bound up to a small factor
Abstract
Model-based algorithms -- algorithms that explore the environment through building and utilizing an estimated model -- are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an -approximate Nash policy in episodes of game playing, where is the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Scheduling and Optimization Algorithms
