Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation
Zhiwei Zhang, Xiaomin Li, Yudi Lin, Hui Liu, Ramraj Chandradevan, Linlin Wu, Minhua Lin, Fali Wang, Xianfeng Tang, Qi He, Suhang Wang

TL;DR
This paper addresses lazy agent behavior in multi-agent LLM reasoning systems, proposing methods to measure influence and encourage deliberation, thereby enhancing collaboration and reasoning performance.
Contribution
It introduces a causal influence measurement and a verifiable reward mechanism to mitigate lazy behavior and improve multi-agent reasoning collaboration.
Findings
Mitigates lazy agent behavior in multi-agent systems
Enhances reasoning accuracy through deliberation mechanisms
Improves collaboration and task performance in complex reasoning
Abstract
Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost…
Peer Reviews
Decision·ICLR 2026 Poster
* The problem is clearly identified and the theoretical analysis gives some insight into why this occurs (though it is an interesting choice to call this Dr. MAMR if the paper goes out of its way to claim that this is distinct from Dr. GRPO). * The introduced causaul influence mechanism is a nice way to measure the impact of the restart action in a computationally-efficient manner (though the effectiveness of the proposed semantic similarity is not really analyzed). * The results show pretty c
* My biggest concern is that the lazy agent behavior (and corresponding proposed solution) is only applicable to one recent multi-agent framework, ReMA. Given that this is a recent (and to this point not very widely used) framework with relatively poor performance (it under-performs single-agent GRPO), it's unclear whether this problem/solution will have much impact. For example, there's nothing to indicate that other multi-agent LLM frameworks will induce similar behavior. The fact that no othe
- Clear diagnosis of the lazy agent failure mode with a simple insight into how turn normalization biases learning toward short dialogues. - Method that is principled and modular with debiased objective, a practical influence estimator that reduces single-trajectory and phrasing bias, and a verifier-aligned restart mechanism. - Consistent gains over strong baselines across multiple model sizes and math benchmarks, with improved stability and better pass-at-K behavior. - Useful ablations and trai
- Empirical scope restricted to math reasoning; claims about multia-gent benefits would be stronger with code and other domains requiring reasoning. - Since the paper emphasizes single vs multi-agent comparison, it would be good to include compute vs performance for single agents and DR MAMR.
-- Motivation: The paper motivates that lazy‑agent behavior collapses multi‑agent systems into single‑agent reasoning, squandering collaboration benefits, which is an issue observed in MARL and newly shown here in LLM multi‑agent reasoning. The motivation is explicit in the introduction and contribution summary. Well structured: -- Crisp theoretical diagnosis of the cause. Theorem 1 in (Sec. 5.1, p. 5) formalizes how the 1/T normalization in multi‑turn GRPO biases the gradient toward shorter
-- Domain generality is narrow. All experiments are on math; claims about “complex reasoning tasks” would be stronger with code, planning, or QA tasks where verifiability is harder. Which can lead to another question: -- Restart relies on verifiable end‑state. The restart reward needs a checkable final answer; outside math (or without ground‑truth/validators), the mechanism may not be directly applicable. Moreover, ablations show restart helps but less than CI/normalization fixes, suggesting nar
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
