Efficient Multi-agent Reinforcement Learning by Planning

Qihan Liu; Jianing Ye; Xiaoteng Ma; Jun Yang; Bin Liang; Chongjie; Zhang

arXiv:2405.11778·cs.LG·May 21, 2024·1 cites

Efficient Multi-agent Reinforcement Learning by Planning

Qihan Liu, Jianing Ye, Xiaoteng Ma, Jun Yang, Bin Liang, Chongjie, Zhang

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces MAZero, a model-based multi-agent reinforcement learning algorithm that combines planning and search techniques to improve sample efficiency and performance in large-scale decision-making tasks.

Contribution

The paper proposes MAZero, integrating a centralized model with Monte Carlo Tree Search and novel techniques for efficient multi-agent planning, advancing model-based MARL methods.

Findings

01

MAZero outperforms model-free methods in sample efficiency.

02

MAZero achieves comparable or better performance than existing model-based approaches.

03

The approach demonstrates effectiveness on the SMAC benchmark.

Abstract

Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks. Nonetheless, most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios. In contrast, model-based reinforcement learning (MBRL), particularly algorithms integrating planning, such as MuZero, has demonstrated superhuman performance with limited data in many tasks. Hence, we aim to boost the sample efficiency of MARL by adopting model-based approaches. However, incorporating planning and search methods into multi-agent systems poses significant challenges. The expansive action space of multi-agent systems often necessitates leveraging the nearly-independent property of agents to accelerate learning. To tackle this issue, we propose the MAZero algorithm, which combines a…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

The paper is a logical extension of MuZero to the multi-agent case, and builds on those ideas to create six neural network functions that underly the model. The writing is clear and the various cost functions and parameters are well laid out. T The experiments push the MARL problem in terms of action space complexity, and show that the MCTS method is effective in providing good performance with reduced search. The primary contribution of the paper is to use prediction along with the reduced

Weaknesses

Ultimately, the method gains in learning efficiency for the game studied, which is an important contribution, although it isn’t clear that there is any performance gain compared to other CDTE methods. The method seems to require global reward information at each agent during execution.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

* MAZero is the first empirically effective approach that extends the MuZero paradigm into multi-agent cooperative environments. * The proposed OS(λ) and AWPO techniques improve search efficiency in large action spaces. * Extensive experiments on the SMAC benchmark demonstrate the effectiveness of MAZero in terms of sample efficiency and performance.

Weaknesses

* The paper focuses on deterministic environments, and it is unclear how well MAZero would perform in stochastic environments. * The proposed techniques may not be applicable to all types of multi-agent environments, and further research is needed to generalize the approach.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The paper is in general well-written, and clear, with thorough appendices for the details. The results given show that this model-based approach is not only more sample efficient than model-free approaches in the MARL setting but also, importantly, tractable when the Optimistic Search Lambda Algorithm and Weighted Policy Optimization are used within the tree-search. While these results are not surprising, such an approach has not been taken before and so there is definitely originality and sig

Weaknesses

The major issue comes down to the single, very specific domain that this has been tested on. While the results are, as stated above, impressive, they are only impressive in this single domain, and it would not seem difficult to show that they are just as significant in other domains with different types of action and state spaces (continuous, discrete, visual, tabular). In addition, I believe that it should become standard within the community to utilise the evaluation protocol of Gorsanne et

Code & Models

Repositories

liuqh16/mazero
pytorchOfficial

Videos

Efficient Multi-agent Reinforcement Learning by Planning· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Average Pooling · Convolution · Batch Normalization · Residual Block · Prioritized Experience Replay · Monte-Carlo Tree Search · MuZero