TL;DR
SILO-BENCH is a comprehensive benchmark testing multi-agent LLM systems' ability to coordinate and synthesize distributed information, revealing a fundamental reasoning gap that worsens with scale.
Contribution
Introduces SILO-BENCH, a novel benchmark with 30 tasks across communication levels, exposing coordination and reasoning limitations in multi-agent LLM systems.
Findings
Agents form task-appropriate topologies and exchange info effectively.
Fail to synthesize distributed info into correct answers during reasoning.
Coordination overhead increases with scale, negating parallelization benefits.
Abstract
Large language models are increasingly deployed in multi-agent systems to overcome context limitations by distributing information across agents. Yet whether agents can reliably compute with distributed information, rather than merely exchange it, remains an open question. We introduce SILO-BENCH, a role-agnostic benchmark of 30 algorithmic tasks across three communication complexity levels, evaluating 54 configurations over 1,620 experiments. Our experiments expose a fundamental Communication-Reasoning Gap: agents spontaneously form task-appropriate coordination topologies and exchange information actively, yet systematically fail to synthesize distributed state into correct answers. The failure is localized to the reasoning-integration stage where agents often acquire sufficient information but cannot integrate it. This coordination overhead compounds with scale, eventually…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
