MARPO: A Reflective Policy Optimization for Multi Agent Reinforcement Learning

Cuiling Wu; Yaozhong Gan; Junliang Xing; Ying Fu

arXiv:2512.22832·cs.MA·December 30, 2025

MARPO: A Reflective Policy Optimization for Multi Agent Reinforcement Learning

Cuiling Wu, Yaozhong Gan, Junliang Xing, Ying Fu

PDF

Open Access

TL;DR

MARPO introduces a novel reflective policy optimization method for multi-agent reinforcement learning, improving sample efficiency and training stability through trajectory reflection and dynamic clipping, outperforming existing methods in standard environments.

Contribution

The paper presents MARPO, a new multi-agent reinforcement learning algorithm that incorporates reflection and adaptive clipping to enhance efficiency and stability.

Findings

01

MARPO outperforms existing methods in classic multi-agent environments.

02

Reflection mechanism improves sample efficiency.

03

Dynamic clipping enhances training stability.

Abstract

We propose Multi Agent Reflective Policy Optimization (MARPO) to alleviate the issue of sample inefficiency in multi agent reinforcement learning. MARPO consists of two key components: a reflection mechanism that leverages subsequent trajectories to enhance sample efficiency, and an asymmetric clipping mechanism that is derived from the KL divergence and dynamically adjusts the clipping range to improve training stability. We evaluate MARPO in classic multi agent environments, where it consistently outperforms other methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques