Efficient Multi-agent Reinforcement Learning by Planning
Qihan Liu, Jianing Ye, Xiaoteng Ma, Jun Yang, Bin Liang, Chongjie, Zhang

TL;DR
This paper introduces MAZero, a model-based multi-agent reinforcement learning algorithm that combines planning and search techniques to improve sample efficiency and performance in large-scale decision-making tasks.
Contribution
The paper proposes MAZero, integrating a centralized model with Monte Carlo Tree Search and novel techniques for efficient multi-agent planning, advancing model-based MARL methods.
Findings
MAZero outperforms model-free methods in sample efficiency.
MAZero achieves comparable or better performance than existing model-based approaches.
The approach demonstrates effectiveness on the SMAC benchmark.
Abstract
Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks. Nonetheless, most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios. In contrast, model-based reinforcement learning (MBRL), particularly algorithms integrating planning, such as MuZero, has demonstrated superhuman performance with limited data in many tasks. Hence, we aim to boost the sample efficiency of MARL by adopting model-based approaches. However, incorporating planning and search methods into multi-agent systems poses significant challenges. The expansive action space of multi-agent systems often necessitates leveraging the nearly-independent property of agents to accelerate learning. To tackle this issue, we propose the MAZero algorithm, which combines a…
Peer Reviews
Decision·ICLR 2024 poster
The paper is a logical extension of MuZero to the multi-agent case, and builds on those ideas to create six neural network functions that underly the model. The writing is clear and the various cost functions and parameters are well laid out. T The experiments push the MARL problem in terms of action space complexity, and show that the MCTS method is effective in providing good performance with reduced search. The primary contribution of the paper is to use prediction along with the reduced
Ultimately, the method gains in learning efficiency for the game studied, which is an important contribution, although it isn’t clear that there is any performance gain compared to other CDTE methods. The method seems to require global reward information at each agent during execution.
* MAZero is the first empirically effective approach that extends the MuZero paradigm into multi-agent cooperative environments. * The proposed OS(λ) and AWPO techniques improve search efficiency in large action spaces. * Extensive experiments on the SMAC benchmark demonstrate the effectiveness of MAZero in terms of sample efficiency and performance.
* The paper focuses on deterministic environments, and it is unclear how well MAZero would perform in stochastic environments. * The proposed techniques may not be applicable to all types of multi-agent environments, and further research is needed to generalize the approach.
The paper is in general well-written, and clear, with thorough appendices for the details. The results given show that this model-based approach is not only more sample efficient than model-free approaches in the MARL setting but also, importantly, tractable when the Optimistic Search Lambda Algorithm and Weighted Policy Optimization are used within the tree-search. While these results are not surprising, such an approach has not been taken before and so there is definitely originality and sig
The major issue comes down to the single, very specific domain that this has been tested on. While the results are, as stated above, impressive, they are only impressive in this single domain, and it would not seem difficult to show that they are just as significant in other domains with different types of action and state spaces (continuous, discrete, visual, tabular). In addition, I believe that it should become standard within the community to utilise the evaluation protocol of Gorsanne et
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Average Pooling · Convolution · Batch Normalization · Residual Block · Prioritized Experience Replay · Monte-Carlo Tree Search · MuZero
