COMMA: A Communicative Multimodal Multi-Agent Benchmark

Timothy Ossowski; Danyal Maqbool; Jixuan Chen; Zefan Cai; Tyler Bradshaw; Junjie Hu

arXiv:2410.07553·cs.AI·December 17, 2025

COMMA: A Communicative Multimodal Multi-Agent Benchmark

Timothy Ossowski, Danyal Maqbool, Jixuan Chen, Zefan Cai, Tyler Bradshaw, Junjie Hu

PDF

Open Access 3 Reviews

TL;DR

COMMA introduces a new benchmark to evaluate multimodal multi-agent systems' collaborative communication, revealing significant weaknesses in current models' ability to effectively communicate and collaborate in complex tasks.

Contribution

This paper presents COMMA, a novel puzzle benchmark for assessing multimodal multi-agent communication and collaboration, addressing a critical gap in existing evaluation frameworks.

Findings

01

State-of-the-art models show surprising weaknesses in communication and collaboration.

02

Many reasoning models struggle to outperform random baselines.

03

Current models have significant room for improvement in multi-agent communication.

Abstract

The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

The focus on Human-AI natural language communication to solve multimodal tasks addresses a valuable research problem, and this paper establishes a promising starting point of this direction using puzzles. The authors provide strong qualitative analysis, offering insights into the weaknesses of various VLMs as agent backbones. These examples help reveal common challenges among VLMs and highlight differences between them, contributing useful knowledge for future research.

Weaknesses

In the AI-human experiments, only three data points are used for each puzzle, which is insufficient, given the relatively small performance differences between models. Additionally, the tasks are not exciting enough because solving puzzles is somewhat far from real-world scenarios, so this benchmark may serve primarily as an introductory step in studying this field.

Reviewer 02Rating 5Confidence 4

Strengths

[1] A benchmark for assessment of the collaborative abilities of VLMs is very valuable. Moreover, the presented benchmark assesses VLMs, the most functional single models to date, as opposed to non multimodal text-only LLMs. [2] The analysis of the failure cases along with figures 3 and 4 is very insightful. [3] The choice of models to evaluate is good: QuenVL and InternVL are at the top of OpenVLM Leaderboard.

Weaknesses

[1] According to Figure 5 Left, the random baseline is very strong meaning that the benchmark is not well designed. This mostly is attributed to the small number of choices and not penalizing wrong choices enough. The claim is that modern agents are not better than the random baseline is strong only if the random baseline is weak which is not the case. In general, intuitively, with random actions during “bomb defusal” I would not expect more than 5% or even 1% success rate for it to be a proper

Reviewer 03Rating 5Confidence 4

Strengths

As a comprehensive benchmark, the authors have evidently dedicated substantial effort. Regarding task distribution, the authors constructed ten subtasks that nearly comprehensively cover various aspects potentially implicated in collaborative task completion. From the perspective of model testing, the authors extensively evaluated numerous existing multimodal models, both closed-source and open-source, thereby effectively highlighting the limitations of current multimodal models. Finally, the au

Weaknesses

The establishment of a benchmark is undeniably a demanding task. However, I wish to express a few concerns. Firstly, in this work, what is the distinction between an agent and a model? In other words, in my previous reviews, I have consistently employed the term "model" rather than "agent" as I believe that this paper is essentially evaluating the capabilities of models and not so-called agents. If you are evaluating the capabilities of agents, please provide a definition of an agent, especially

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation