When Do Multi-Agent Systems Outperform? Analysing the Learning Efficiency of Agentic Systems
Junwei Su, Chuan Wu

TL;DR
This paper provides a theoretical analysis of when Multi-Agent Reinforcement Learning (MARL) outperforms Single-Agent RL (SARL) in training large language models, focusing on sample efficiency and task decomposition.
Contribution
It introduces a formal PAC framework for comparing MARL and SARL, deriving sample complexity bounds and analyzing the effects of task decomposition and alignment.
Findings
MARL outperforms SARL with independent subtasks
Dependent subtasks reduce MARL's advantage
Task alignment impacts learning efficiency
Abstract
Reinforcement Learning (RL) has emerged as a crucial method for training or fine-tuning large language models (LLMs), enabling adaptive, task-specific optimizations through interactive feedback. Multi-Agent Reinforcement Learning (MARL), in particular, offers a promising avenue by decomposing complex tasks into specialized subtasks learned by distinct interacting agents, potentially enhancing the ability and efficiency of LLM systems. However, theoretical insights regarding when and why MARL outperforms Single-Agent RL (SARL) remain limited, creating uncertainty in selecting the appropriate RL framework. In this paper, we address this critical gap by rigorously analyzing the comparative sample efficiency of MARL and SARL within the context of LLM. Leveraging the Probably Approximately Correct (PAC) framework, we formally define SARL and MARL setups for LLMs, derive explicit sample…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper addresses an important and timely problem in the context of LLM-based RL by clarifying the theoretical boundary between MARL and SARL applicability. It provides valuable insights into how task decomposition structure influences sample efficiency and extends the discussion to imperfect alignment cases, offering practical relevance beyond ideal theoretical settings.
1. The paper does not specify whether agents share the same input or have partial observations. Would this assumption affect the validity of the theoretical derivations? 2. Several symbols, such as $r_i$ in Eq.~(3.1), are introduced without clear definitions. Please provide precise definitions to ensure the rigor and clarity of the theoretical reasoning. 3. The paper does not clearly explain why the $K^2$ term in Theorem~4.2 represents a tight bound. The dependence among rewards $r_i$ is not for
- The paper addresses a valuable and timely problem: providing a theoretical understanding of when multi-agent systems outperform single-agent systems. Such an analysis is needed to ground current LLM-based multi-agent research in solid theory. - The theoretical foundation is solid. The authors derive PAC-based sample complexity bounds for both MARL and SARL, offering a rigorous comparison of their learning efficiencies. - The inclusion of a small empirical study adds some empirical support to
- While the theoretical analysis is sound, the setting is overly simplified. The “multi-agent” formulation effectively reduces to a fixed workflow where each agent corresponds to a submodel handling a static subtask. This abstraction misses the richer dynamics of real LLM-based multi-agent systems, where benefits and failures often stem from high-level task decomposition and coordination rather than low-level execution efficiency. - The conclusions that multi-agent systems help when subtasks ar
LLM agentic systems have recently been frequently explored for complex task processing based on MARL. While many empirical algorithms have emerged for MARL-based approaches, theoretical guidance is currently lacking. This paper presents a rigorous theoretical derivation and, in the LLM agentic setting, provides a systematic PAC comparison and testable thresholds for SARL vs. MARL. In particular, the introduced "task alignment" factor, $\alpha$, quantifies the sample cost of MARL strategies in
1. Experimental Verification: First, I think there is a gap between the experimental verification and and motivation & background, which are based on a complex LLM agentic system. Section 4.4 only uses a lightweight synthetic linear task. The "dependence" in the synthetic experiment (i.e., the current output depends on the mean of the previous output) is completely different from the complex semantics, logic, and state dependencies in the LLM agentic system. Therefore, while this experiment math
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Language and cultural evolution · Topic Modeling
