Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning
Yulai Zhao, Zhuoran Yang, Zhaoran Wang, Jason D. Lee

TL;DR
This paper introduces a provably convergent multi-agent PPO algorithm that leverages local optimization to achieve global optimality in cooperative Markov games, supported by theoretical guarantees and experimental validation.
Contribution
It presents the first provably convergent multi-agent PPO algorithm with theoretical guarantees and extends it to off-policy settings with pessimism for improved performance.
Findings
Algorithm converges to global optimum at a sublinear rate
Extension to off-policy setting with pessimism improves evaluation
First provably convergent multi-agent PPO in cooperative games
Abstract
Policy optimization methods with function approximation are widely used in multi-agent reinforcement learning. However, it remains elusive how to design such algorithms with statistical guarantees. Leveraging a multi-agent performance difference lemma that characterizes the landscape of multi-agent policy optimization, we find that the localized action value function serves as an ideal descent direction for each local policy. Motivated by the observation, we present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate. We extend our algorithm to the off-policy setting and introduce pessimism to policy evaluation, which aligns with experiments. To our knowledge,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Game Theory and Applications · Distributed Control Multi-Agent Systems
MethodsEntropy Regularization · Proximal Policy Optimization
