Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Haochen Sun; Shuwen Zhang; Lujie Niu; Lei Ren; Hao Xu; Hao Fu; Fangkun Zhao; Caixia Yuan; Xiaojie Wang

arXiv:2502.20073·cs.CL·December 2, 2025

Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, Xiaojie Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Collab-Overcooked, a comprehensive benchmark for evaluating large language models as collaborative agents in multi-agent interactive environments, emphasizing collaboration and process-oriented metrics.

Contribution

It presents a novel multi-agent benchmark with diverse tasks and evaluation metrics, enabling systematic assessment of LLMs' collaborative capabilities in complex scenarios.

Findings

01

LLMs show strong goal interpretation abilities.

02

Significant gaps in active collaboration and adaptation.

03

Benchmark and tools are publicly available.

Abstract

Large Language Models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-based Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks in two novel ways. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments with 13 popular LLMs and show that, while the LLMs exhibit a strong ability in goal interpretation, there are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yusaemeow/collab-overcooked
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education