Rethinking Ratio-Based Trust Regions for Policy Optimization in Multi-Agent Reinforcement Learning
Chulabhaya Wijesundara, Andrea Baisero, Zhongheng Li, Gregory Casta\~n\'on, Alan Carlin, Christopher Amato

TL;DR
This paper introduces MARS, a new policy optimization method for multi-agent reinforcement learning that overcomes limitations of existing ratio-based trust-region approaches, improving stability and performance.
Contribution
MARS replaces additive ratio clipping with a multiplicative symmetric barrier, enhancing gradient correction and reducing failure modes in multi-agent policy optimization.
Findings
MARS matches or exceeds MAPPO and MASPO performance across 47 tasks.
Ablation studies confirm the benefits derive from the symmetric barrier geometry.
MARS performs well on novel JAX benchmarks PaxMen and AeroJAX.
Abstract
Centralized training with decentralized execution (CTDE) is a standard framework for cooperative multi-agent policy-gradient reinforcement learning, allowing agents to learn from joint information while acting from local observations. Ratio-based trust-region methods such as Multi-Agent Proximal Policy Optimization (MAPPO) and Multi-Agent Simple Policy Optimization (MASPO) update decentralized actors using per-agent probability ratios weighted by joint advantage estimates. Teammate non-stationarity increases the variance of these advantages, which in turn increases the variance in the local ratio updates. This exposes two method-specific failure modes: MAPPO's additive clipping removes gradients for outlier samples and weakens recovery from policy drift, while MASPO's soft quadratic penalty can allow probability collapse. We introduce Multi-Agent Ratio Symmetry (MARS), a novel policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
