TeamBench: Evaluating Agent Coordination under Enforced Role Separation

Yubin Kim; Chanwoo Park; Taehan Kim; Eugene Park; Samuel Schmidgall; Salman Rahman; Chunjong Park; Cynthia Breazeal; Xin Liu; Hamid Palangi; Hae Won Park; Daniel McDuff

arXiv:2605.07073·cs.AI·May 11, 2026

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

Yubin Kim, Chanwoo Park, Taehan Kim, Eugene Park, Samuel Schmidgall, Salman Rahman, Chunjong Park, Cynthia Breazeal, Xin Liu, Hamid Palangi, Hae Won Park, Daniel McDuff

PDF

1 Datasets

TL;DR

TeamBench is a benchmark designed to evaluate agent coordination under enforced role separation, revealing insights into team dynamics and the limitations of pass rates in multi-agent systems.

Contribution

It introduces a novel benchmark with enforced role separation to better assess true agent coordination beyond pass rates.

Findings

01

Prompt-only and sandbox teams have similar pass rates.

02

Verifiers often approve incorrect submissions, affecting score accuracy.

03

Human studies show different interaction patterns under enforced role separation.

Abstract

Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor's code. Verifiers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ybkim95/teambench
dataset· 155 dl
155 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.