How to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMs
Andrew Estornell, Jean-Francois Ton, Muhammad Faaiz Taufiq, Hang Li

TL;DR
This paper presents a hierarchical multi-agent framework where a single leader LLM is trained to coordinate multiple peer agents, improving reasoning performance efficiently without auxiliary feedback mechanisms.
Contribution
Introduces MLPO, a novel training method for a leader LLM to coordinate peer agents, enhancing multi-agent reasoning without additional value networks or explicit feedback.
Findings
Significant performance improvements on BBH, MATH, and MMLU benchmarks.
Efficient training of a single leader LLM for multi-agent coordination.
Enhanced single-agent performance when deploying the trained leader without the team.
Abstract
Large Language Models (LLMs) have achieved strong performance on a wide range of complex reasoning tasks, yet further gains are often possible by leveraging the complementary strengths of multiple models. While multi-agent frameworks can improve solution quality by leveraging multiple LLMs, existing methods are often computationally expensive, both at training and inference time. In this work, we introduce a hierarchical multi-agent framework that addresses these challenges by training only a single leader LLM to coordinate a team of untrained peer agents. To this end, we propose Multi-agent guided Leader Policy \textbf{O}ptimization (MLPO), a novel approach which trains the leader to evaluate and synthesize agent responses without auxiliary value networks or explicit agent feedback. Leaders trained with MLPO exhibit improved performance not only when interacting with the agent team at…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper introduces a novel and practical hierarchical multi-agent framework. Its main advantage is its computational efficiency, as it only requires training a single "leader" model while coordinating a team of fixed, untrained peer agents. This significantly reduces the training cost and complexity compared to approaches that require co-training multiple specialized models. - The proposed Multi-agent guided Leader Policy Optimization (MLPO) is a novel contribution. It provides an effective
- The benchmark (MMLU, BBH, MATH) does not seem to be related to a multi-agent system. They are knowledge-intensive tasks. I hope the author can provide some clarification on why they chose these benchmarks. - Figure 2 seems not to be related to Figure 1, where in Figure 1, the data generation pipeline is online, i.e., the leader's feedback will return to the multi-agent system for new rollouts. If the leader's feedback cannot return to the multi-agent system, the leader will be degraded to a su
1. Training only a single leader LLM to cooperate among several untrained companion agents significantly reduces training and maintenance costs, yet preserves the benefits of collaboration. 2. The paper introduces the MLPO training framework: construct SFT data to enhance the model’s self-correction ability, and directly optimize the leader under the GRPO framework—resulting in a relatively simple training process. 3. The trained leader model can operate in both “team collaboration” and “single-
1. The experiments are conducted with a collaboration setting of K = 3 agents, without verifying the effectiveness of the proposed method as the number of untrained agents increases. Since MLPO treats untrained agents as part of the environment, increasing their number would make the environment more complex. 2. Although training only a single leader model reduces training costs, the overall collaboration quality may be constrained by the leader’s capability. If the leader LLM’s task ability is
-- Clear, modular architecture. §3.1 (pp. 3–4) details a two‑level hierarchy and the T‑round interaction loop; Figure 1 (p. 3) visually clarifies the leader–agents workflow and the think/answer structure the leader emits. -- Well‑specified objective. §3.2 (pp. 4–5) formalizes MLPO as a GRPO variant that conditions the leader on agent responses, with Dr.GRPO‑style stability tweaks; the training‑data pipeline (4K agent proposals per task; filtered “easy” tasks) is explicit (§3.2, p. 5). -- Robu
The most concern for me is the potential unfair comparison: -- SFT transparency & potential advantage. Appendix A.1 describes how the SFT data are constructed (synthetic backtracking/self‑correction), but omits crucial statistics: dataset size, token counts, domain/source mix, and sampling rules (p. 17). This makes it hard to judge how much of the gain stems from SFT itself vs. MLPO, and it obscures fairness vs. baselines that may not receive equivalent SFT. -- Uneven pretraining across baseli
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Business Process Modeling and Analysis
