AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs
Florian Gr\"otschla, Luis M\"uller, Jan T\"onshoff, Mikhail Galkin, Bryan Perozzi

TL;DR
AgentsNet introduces a scalable benchmark for evaluating multi-agent LLM systems' ability to self-organize, communicate, and solve problems collaboratively across various network topologies, revealing strengths and limitations of current models.
Contribution
The paper presents AgentsNet, a new scalable benchmark inspired by distributed systems, to assess multi-agent LLMs' collaborative reasoning and self-organization capabilities.
Findings
Frontier LLMs perform well on small networks but struggle as network size increases.
Existing benchmarks are limited to 2-5 agents, while AgentsNet scales to 100 agents.
Some models show strong coordination in small setups but face challenges in larger, more complex networks.
Abstract
Large-language models (LLMs) have demonstrated powerful problem-solving capabilities, in particular when organized in multi-agent systems. However, the advent of such systems also raises several questions on the ability of a complex network of agents to effectively self-organize and collaborate. While measuring performance on standard reasoning benchmarks indicates how well multi-agent systems can solve reasoning tasks, it is unclear whether these systems are able to leverage their topology effectively. Here, we propose AgentsNet, a new benchmark for multi-agent reasoning. By drawing inspiration from classical problems in distributed systems and graph theory, AgentsNet measures the ability of multi-agent systems to collaboratively form strategies for problem-solving, self-organization, and effective communication given a network topology. We evaluate a variety of baseline methods on…
Peer Reviews
Decision·Submitted to ICLR 2026
**Scalability to large agent networks**: Tests up to 100 agents (Figure 5), far exceeding existing benchmarks limited to 2-5 agents. **Theoretically grounded tasks**: Leverages well-studied distributed computing problems with known complexity bounds (Table 1). **Comprehensive model evaluation**: Tests 10+ frontier models across different cost-performance trade-offs (Figure 1).
**Unclear protocol differentiation**: The paper claims to develop a "new multi-agent protocol" but doesn't differentiate from existing protocols like A2A or ANP mentioned in the survey (arXiv:2504.16736). The LOCAL model adaptation (Section 4) appears standard without clear innovation. **Missing baseline comparisons**: No experimental comparison with established multi-agent topology methods (MACNET, MAS-GPT, GPTSwarm, Dylan, Tree, Graph(mesh, DAG), Star topology) despite citing them. Only class
The benchmark is indeed valuable, and the experiments are helpful to understand the state-of-the-art performance of agentic networks. The problems make sense in the context of decentralised coordination and “distributed intelligence”. I appreciate that the authors reported the complexity of the algorithms in the distributed setting (though they could spend a sentence on stressing that log* is a very slow-growing logarithm function: log*100 is actually very, very close to log*5, so the network s
The networks are not heterogeneous and comprise, in each evaluation, one model. I reckon an evaluation where models are mixed would give interesting insights into “blocking” nodes and communication issues that arise when different models interact. An analysis that would make this paper stronger is what kind of asynchronous algorithm agentic networks implement, and see if that varies with different sizes and models. Since the models do not see anything but their neighbours, I expect the algorith
This work introduces five tasks capable of scaling multi-agent systems to up to 100 agents, surpassing most existing studies in this domain. The authors observe a clear performance degradation as the network size increases. In addition, the paper benchmarks 27 network topologies using 10 state-of-the-art LLMs, which further strengthens the validity of the findings and insights. Overall, the paper is well-structured and clearly written, with figures that are easy to interpret and follow.
1. Compared with prior work such as MacNet [1], this paper reaches a different conclusion — the performance of current LLMs degrades across the proposed five tasks. This discrepancy should be further discussed and explained in Section 5.3. 2. The analysis is relatively limited. For distributed computing problems, metrics such as average communication rounds, concurrency characteristics, and other protocol-level indicators should be reported to more thoroughly understand the bottlenecks and failu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
