MARPO: A Reflective Policy Optimization for Multi Agent Reinforcement Learning
Cuiling Wu, Yaozhong Gan, Junliang Xing, Ying Fu

TL;DR
MARPO introduces a novel reflective policy optimization method for multi-agent reinforcement learning, improving sample efficiency and training stability through trajectory reflection and dynamic clipping, outperforming existing methods in standard environments.
Contribution
The paper presents MARPO, a new multi-agent reinforcement learning algorithm that incorporates reflection and adaptive clipping to enhance efficiency and stability.
Findings
MARPO outperforms existing methods in classic multi-agent environments.
Reflection mechanism improves sample efficiency.
Dynamic clipping enhances training stability.
Abstract
We propose Multi Agent Reflective Policy Optimization (MARPO) to alleviate the issue of sample inefficiency in multi agent reinforcement learning. MARPO consists of two key components: a reflection mechanism that leverages subsequent trajectories to enhance sample efficiency, and an asymmetric clipping mechanism that is derived from the KL divergence and dynamically adjusts the clipping range to improve training stability. We evaluate MARPO in classic multi agent environments, where it consistently outperforms other methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques
