The Collaboration Gap
Tim R. Davidson, Adam Fourney, Saleema Amershi, Robert West, Eric Horvitz, Ece Kamar

TL;DR
This paper introduces a benchmark for evaluating collaboration among AI agents, revealing significant performance drops in multi-agent settings and proposing strategies like relay inference to improve teamwork.
Contribution
It presents a scalable, ecologically valid benchmark for assessing agent collaboration and uncovers the 'collaboration gap' in current models, suggesting new training and interaction strategies.
Findings
Models often perform poorly when collaborating compared to solo performance.
Starting with the stronger agent improves collaborative outcomes.
Relay inference can significantly close the collaboration gap.
Abstract
The trajectory of AI development suggests that we will increasingly rely on agent-based systems composed of independently developed agents with different information, privileges, and tools. The success of these systems will critically depend on effective collaboration among these heterogeneous agents, even under partial observability. Despite intense interest, few empirical studies have evaluated such agent-agent collaboration at scale. We propose a collaborative maze-solving benchmark that (i) isolates collaborative capabilities, (ii) modulates problem complexity, (iii) enables scalable automated grading, and (iv) imposes no output-format constraints, preserving ecological plausibility. Using this framework, we evaluate 32 leading open- and closed-source models in solo, homogeneous, and heterogeneous pairings. Our results reveal a "collaboration gap": models that perform well solo…
Peer Reviews
Decision·Submitted to ICLR 2026
# originality Evaluating collaboration between AI agents has been an ongoing area of research for decades. This appears to be a variant of the overcooked test [1], but with lower complexity and partial information. As they do not discuss how this fit into the literature it is hard to evaluate the originality. # quality The experiments appear to have been done well, and they test a large number of LLM variants. The use of an automated LLM as grader is concerning, but they discuss the issues
The 6x6 maze seems very small, A* can solve that trivially. The use of a grader AI adds an additional level of complexity to the experiment. The authors don't appear to engage with the SOTA in collaborative AI, instead focusing on LLMs only. Most of my other concerns are in the other sections, I think if the paper significantly toned down it's claims it would be publishable.
- The analysis of homogeneous, same-family (different strengths), and cross-model heterogeneous collaboration provides cool and valuable insights into agent interaction dynamics. - The overall scope of the evaluations conducted appears quite comprehensive, covering many models and various collaboration settings. - The proposed collaborative maze-solving benchmark is novel, isolates collaborative capabilities, and imposes minimal output constraints, which is a strong methodological contribution
It is unclear whether LLM collaboration failures in this specific maze task translate to an inability to collaborate effectively in more naturalistic use cases, such as coding tasks. Human performance on these specific maze tasks is missing, which makes it difficult to fully substantiate claims about the LLM "collaboration gap." The reliance on the autograder is questionable, as it may introduce systemic biases or errors compared to enforcing a deterministic output format for all models. The
Multi-agent collaboration is indeed a more and more important topics. This paper contributes to identifying key potential issues occurred in LLM cooperation tasks by investigating the LLM performance on a maza task. The paper writing is easy to follow, and the results are explained clearly. The results indeed demonstrates the existence of collaboration gap. I found some experiment results are interesting (e.g. the relay inference part).
1. My main concern is that the contribution in this paper, although insightful, may not reach the threshold of the acceptable of this top-tier conference. The main contribution is limited in identifying the collaboration gap, but the authors did not make progress beyond that. I believe some contribution on algorithm design to close the collaboration gap would be helpful and makes the paper stronger. 2. The experiment in this paper is limited in the maza task. It is enough to suggest "the collab
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Mobile Crowdsensing and Crowdsourcing · Language and cultural evolution
