Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning
Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun,, Jun Wang, Yaodong Yang

TL;DR
This paper extends trust region methods to multi-agent reinforcement learning, proposing new algorithms that ensure monotonic policy improvement without requiring shared parameters, and demonstrates their superior performance on complex tasks.
Contribution
The paper introduces HATRPO and HAPPO algorithms with theoretical guarantees of monotonic improvement in MARL, without needing shared parameters or restrictive assumptions.
Findings
HATRPO and HAPPO outperform existing algorithms on multiple benchmarks.
Theoretical proof of monotonic improvement for the proposed algorithms.
Achieved state-of-the-art results on Multi-Agent MuJoCo and StarCraftII tasks.
Abstract
Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Advanced Memory and Neural Computing
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Adam · Batch Normalization · Experience Replay · Dense Connections · Weight Decay · MADDPG
