Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs
Wei Yang, Jiacheng Pang, Shixuan Li, Paul Bogdan, Stephen Tu, Jesse Thomason

TL;DR
Maestro introduces a multi-agent framework with specialized exploration and synthesis roles, utilizing a novel reinforcement learning method to improve collaborative reasoning in large language models, leading to significant accuracy gains.
Contribution
The paper presents Maestro, a new multi-agent collaboration paradigm with role decoupling and a novel CLPO training method, advancing multi-agent LLM capabilities.
Findings
Achieves 6% average accuracy improvement over state-of-the-art methods.
Demonstrates effectiveness on mathematical reasoning and problem-solving benchmarks.
Outperforms existing approaches with up to 10% accuracy gains.
Abstract
Multi-agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To resolve this core challenge, we propose the Multi-Agent Exploration-Synthesis framework Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples these cognitive modes. Maestro uses a collective of parallel Execution Agents for diverse…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The authors address an important problem in multi-agent LLM systems, that is, divergent exploration followed by convergent synthesis of the explored solution space. 2. The proposed MAESTRO framework seems sound and nicely decomposes exploration and convergence. 3. The proposed CLPO RL objective effectively decomposes decisions and reasons to enable more effective credit-assignment. 4. They provide interesting analyses in section 4.2 on the effect of collaboration mechanisms, parts of the CLPO
1. It is unclear how diverse the generated solutions in Phase 1 really are. Is there any way to enforce more diversity in this phase? 2. The proposed CLPO objective (despite good motivation) consists of 4 loss terms with individual weights. However, there is no analysis on the sensitivity of the weighting terms of the proposed components. This would be valuable for the reader to better understand the difficulty of tuning CLPO. 3. For RL tuning, the paper primarily focuses on the small-scale mode
- The motivation is clear and problem is well-framed. The divergent vs convergent tension in collaborative LLMs is well articulated, and MAESTRO maps directly onto that cognitive split. This makes the system design easy to reason about. The credit assignment issue has been a pain point in the field of LLM RL, and any attempt to solve this problem is an appreciated effort. - The paper is well-written and well articulated.
- Claimed improvements of about 1% over GRPO on math/coding/MMLU (~2% on AMC) are too small to justify the substantially more contrived objective and orchestration. I find it hard to see these increments as significant improvement over GRPO. - The compute parity are under-specified for both all baselines. t’s unclear whether baselines (incl. GRPO) operate under matched total sampling budget (agent count × K × rounds), identical decoding setups, and equal reference-KL constraints. Without strict
I overall find this paper fairly strong. Originality: both the multi-agent framework, sans RL, and the RL are to my knowledge novel contributions. Further, I think the way the method is constructed is principled in a way that has clear relation to and interpretable differences from prior work — it’s a natural but still creative addition. Quality: the experiments are extensive. The headline comparisons are good and the analysis experiments help answer a range of natural questions (how does it s
Part of this stems from a question: I’d like to better understand how you control for inference budget. The appendix states that you match for collaboration budget (rounds, agents, and generations). Can one match all three, for all baselines? If so, how? Related, I’m trying to contextualize the results of Fig 2 with Fig 1: if I’m understanding correctly, the difference between SC and Central-Gen+SC is that Central-Gen+SC allows the convergence stage access to all of the divergent stage generatio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics
