M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Yukai Feng; Zhiheng Wu; Zhengxing Wu; Junwen Gu; Junzhi Yu

arXiv:2604.19404·cs.RO·April 22, 2026

M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Yukai Feng, Zhiheng Wu, Zhengxing Wu, Junwen Gu, Junzhi Yu

PDF

TL;DR

The paper introduces M$^{2}$GRPO, a novel multi-agent policy optimization framework for biomimetic underwater robots that enhances long-horizon decision making, inter-robot coordination, and stability, outperforming existing methods.

Contribution

It proposes a new Mamba-based group-relative policy optimization method that integrates attention mechanisms and reward normalization for scalable, stable multi-agent pursuit in underwater robots.

Findings

01

M$^{2}$GRPO outperforms MAPPO and recurrent baselines in pursuit success rate.

02

The method improves capture efficiency in simulated and real-world experiments.

03

It reduces training resource demands while maintaining stability and scalability.

Abstract

Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M $^{2}$ GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.