TeamBench: Evaluating Agent Coordination under Enforced Role Separation
Yubin Kim, Chanwoo Park, Taehan Kim, Eugene Park, Samuel Schmidgall, Salman Rahman, Chunjong Park, Cynthia Breazeal, Xin Liu, Hamid Palangi, Hae Won Park, Daniel McDuff

TL;DR
TeamBench is a benchmark designed to evaluate agent coordination under enforced role separation, revealing insights into team dynamics and the limitations of pass rates in multi-agent systems.
Contribution
It introduces a novel benchmark with enforced role separation to better assess true agent coordination beyond pass rates.
Findings
Prompt-only and sandbox teams have similar pass rates.
Verifiers often approve incorrect submissions, affecting score accuracy.
Human studies show different interaction patterns under enforced role separation.
Abstract
Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor's code. Verifiers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
