Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Haoyang Hong; Jiajun Yin; Yuan Wang; Jingnan Liu; Zhe Chen; Ailing Yu; Ji Li; Zhiling Ye; Hansong Xiao; Yefei Chen; Hualei Zhou; Yun Yue; Minghui Yang; Chunxiao Guo; Junwei Liu; Peng Wei; Jinjie Gu

arXiv:2511.13288·cs.AI·November 19, 2025

Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, Jinjie Gu

PDF

Open Access

TL;DR

This paper introduces M-GRPO, a hierarchical training method for multi-agent systems with distinct LLMs, improving stability and efficiency in tool-augmented reasoning tasks through trajectory alignment and decoupled optimization.

Contribution

The paper proposes M-GRPO, a hierarchical policy optimization algorithm that enables scalable training of multi-agent systems with separate LLMs, addressing optimization challenges in distributed environments.

Findings

01

M-GRPO outperforms single-agent and frozen sub-agent baselines on real-world benchmarks.

02

Decoupled training pipeline enables scalable multi-agent training across servers.

03

Trajectory alignment improves stability and sample efficiency in multi-agent reasoning.

Abstract

Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Topic Modeling