Benchmarking LLMs' Swarm intelligence
Kai Ruan, Mowen Huang, Ji-Rong Wen, Hao Sun

TL;DR
This paper introduces SwarmBench, a new benchmark to evaluate the swarm intelligence of Large Language Models in decentralized multi-agent tasks with limited local perception and communication, revealing current limitations in their coordination abilities.
Contribution
We present SwarmBench, a comprehensive benchmark with five MAS tasks for assessing LLMs' decentralized coordination, along with evaluation metrics and open-source tools for reproducible research.
Findings
LLMs show task-dependent performance variations.
Current LLMs struggle with long-range planning in decentralized scenarios.
Some rudimentary coordination observed in LLMs.
Abstract
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict swarm-like constraints-limited local perception and communication-remains largely unexplored. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks (Pursuit, Synchronization, Foraging, Flocking, Transport) within a configurable 2D grid environment, forcing agents to rely solely on local sensory input ( view) and local communication. We propose metrics for coordination…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Originality: SwarmBench is a novel benchmark specifically targeting decentralized, emergent coordination, a largely unexplored dimension in LLM-based MAS. 2. Quality: The experimental framework is comprehensive, with well-defined tasks, robust baselines, and rich metrics. 3. Significance: Highlights the limitations of current LLMs in real-world decentralized coordination scenarios, guiding future research. 4. Clarity: Visualizations and benchmark design are intuitive, facilitating understa
1. Limited exploration of training regimes: The exclusive use of zero-shot evaluation neglects the potential of in-context learning or RL fine-tuning to improve coordination. 2. Shallow communication analysis: Although the paper emphasizes the role of local communication, it does not deeply investigate language content or its evolution over time. 3. Scalability questions: While the benchmark is extensible, the current use case is limited to abstract 2D environments and may not generalize to hi
The article proposes a useful benchmark, and the code is released as an open-source toolkit, making it easier to replicate the experiments and use it to train LLMs and other agents on the task. The experiments are run on several models and, apart from Figure 3, the others and the Results in general are easy to read and understand. A Strength, that is also a Weakness, is how results are presented. The Appendix contains most of the interesting results. For example, Appendix L5 should be in the
While the benchmark is indeed useful and shows some coordination failures of top-performing LLMs, the paper does not propose any mechanism to mitigate the aforementioned issues, and some benchmarks’ results are difficult to interpret. Some insights are interesting (Claude outperforms any other model at Synchronization, all the models are bad at Transport): on the other hand, the paper lacks an insightful analysis of the reasons behind this failure (beyond saying that models cannot do long-horiz
- Benchmarking LLMs' capability in coordination and self-organization at the swarm level is important. - The paper is well-motivated. - There are tons of results and analyses in the Appendix.
- The main paper doesn't convey much information; all details are deferred to the Appendix, including the critical part of how the benchmark is designed, what the observation and action space for agents are, and what the metric is. With 80+ pages of the Appendix, it's hard to evaluate the true value of the paper. - The benchmark seems more designed for testing llm agents rather than LLMs directly. Without memory or planning structures, the tasks seem prohibitively challenging. - No valuable i
1. This paper is the first to systematically evaluate the collaborative capabilities of LLM-based agents in swarm settings. The proposed benchmark environments capture classic coordination scenarios with broad applicability across multiple domains. 2. The paper is well-structured and clearly written, making it easy to follow the methodology and experimental findings. 3. The ablation studies are thoughtfully designed and provide meaningful insights. In particular, the analyses of local perception
1. **The setting is not real enough for swarm tasks.** In particular, the benchmark enforces anonymous, purely local broadcast messages, which preclude agents from maintaining stable partner identities even when they remain within each other’s field of view. This design blocks **consistent, neighbor-specific conventions (e.g., leader–follower handoffs, partner lock-on, trust updating)**, thereby underestimating the achievable coordination of embodied multi-agent systems with **local but consiste
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language and cultural evolution · Modular Robots and Swarm Intelligence
MethodsMixing Adam and SGD
