AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs

Florian Gr\"otschla; Luis M\"uller; Jan T\"onshoff; Mikhail Galkin; Bryan Perozzi

arXiv:2507.08616·cs.MA·July 14, 2025

AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs

Florian Gr\"otschla, Luis M\"uller, Jan T\"onshoff, Mikhail Galkin, Bryan Perozzi

PDF

1 Datasets 3 Reviews

TL;DR

AgentsNet introduces a scalable benchmark for evaluating multi-agent LLM systems' ability to self-organize, communicate, and solve problems collaboratively across various network topologies, revealing strengths and limitations of current models.

Contribution

The paper presents AgentsNet, a new scalable benchmark inspired by distributed systems, to assess multi-agent LLMs' collaborative reasoning and self-organization capabilities.

Findings

01

Frontier LLMs perform well on small networks but struggle as network size increases.

02

Existing benchmarks are limited to 2-5 agents, while AgentsNet scales to 100 agents.

03

Some models show strong coordination in small setups but face challenges in larger, more complex networks.

Abstract

Large-language models (LLMs) have demonstrated powerful problem-solving capabilities, in particular when organized in multi-agent systems. However, the advent of such systems also raises several questions on the ability of a complex network of agents to effectively self-organize and collaborate. While measuring performance on standard reasoning benchmarks indicates how well multi-agent systems can solve reasoning tasks, it is unclear whether these systems are able to leverage their topology effectively. Here, we propose AgentsNet, a new benchmark for multi-agent reasoning. By drawing inspiration from classical problems in distributed systems and graph theory, AgentsNet measures the ability of multi-agent systems to collaboratively form strategies for problem-solving, self-organization, and effective communication given a network topology. We evaluate a variety of baseline methods on…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

**Scalability to large agent networks**: Tests up to 100 agents (Figure 5), far exceeding existing benchmarks limited to 2-5 agents. **Theoretically grounded tasks**: Leverages well-studied distributed computing problems with known complexity bounds (Table 1). **Comprehensive model evaluation**: Tests 10+ frontier models across different cost-performance trade-offs (Figure 1).

Weaknesses

**Unclear protocol differentiation**: The paper claims to develop a "new multi-agent protocol" but doesn't differentiate from existing protocols like A2A or ANP mentioned in the survey (arXiv:2504.16736). The LOCAL model adaptation (Section 4) appears standard without clear innovation. **Missing baseline comparisons**: No experimental comparison with established multi-agent topology methods (MACNET, MAS-GPT, GPTSwarm， Dylan, Tree, Graph(mesh, DAG), Star topology) despite citing them. Only class

Reviewer 02Rating 6Confidence 4

Strengths

The benchmark is indeed valuable, and the experiments are helpful to understand the state-of-the-art performance of agentic networks. The problems make sense in the context of decentralised coordination and “distributed intelligence”. I appreciate that the authors reported the complexity of the algorithms in the distributed setting (though they could spend a sentence on stressing that log* is a very slow-growing logarithm function: log*100 is actually very, very close to log*5, so the network s

Weaknesses

The networks are not heterogeneous and comprise, in each evaluation, one model. I reckon an evaluation where models are mixed would give interesting insights into “blocking” nodes and communication issues that arise when different models interact. An analysis that would make this paper stronger is what kind of asynchronous algorithm agentic networks implement, and see if that varies with different sizes and models. Since the models do not see anything but their neighbours, I expect the algorith

Reviewer 03Rating 4Confidence 4

Strengths

This work introduces five tasks capable of scaling multi-agent systems to up to 100 agents, surpassing most existing studies in this domain. The authors observe a clear performance degradation as the network size increases. In addition, the paper benchmarks 27 network topologies using 10 state-of-the-art LLMs, which further strengthens the validity of the findings and insights. Overall, the paper is well-structured and clearly written, with figures that are easy to interpret and follow.

Weaknesses

1. Compared with prior work such as MacNet [1], this paper reaches a different conclusion — the performance of current LLMs degrades across the proposed five tasks. This discrepancy should be further discussed and explained in Section 5.3. 2. The analysis is relatively limited. For distributed computing problems, metrics such as average communication rounds, concurrency characteristics, and other protocol-level indicators should be reported to more thoroughly understand the bottlenecks and failu

Code & Models

Datasets

disco-eth/AgentsNet
dataset· 229 dl
229 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.