Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

Tao Yu; Hao Wang; Changyu Li; Shenghua Chai; Minghui Zhang; Zhongtian Luo; Yuxuan Zhou; Haopeng Jin; Zhaolu Kang; Jiabing Yang; YiFan Zhang; Xinming Wang; Hongzhu Yi; Zheqi He; Jing-Shu Zheng; Xi Yang; Yan Huang; Liang Wang

arXiv:2605.08761·cs.MA·May 12, 2026

Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

Tao Yu, Hao Wang, Changyu Li, Shenghua Chai, Minghui Zhang, Zhongtian Luo, Yuxuan Zhou, Haopeng Jin, Zhaolu Kang, Jiabing Yang, YiFan Zhang, Xinming Wang, Hongzhu Yi, Zheqi He, Jing-Shu Zheng, Xi Yang, Yan Huang, Liang Wang

PDF

1 Datasets

TL;DR

This paper introduces extsc{EntCollabBench}, a benchmark for evaluating multi-agent collaboration in enterprise settings, highlighting current challenges faced by LLM agents in realistic organizational tasks.

Contribution

It presents a new benchmark that simulates role-specific, permission-controlled enterprise environments to evaluate multi-agent collaboration capabilities.

Findings

01

Current LLM agents struggle with enterprise collaboration tasks.

02

Agents have difficulty with delegation, context transfer, and decision-making.

03

The benchmark provides a reproducible environment for future improvements.

Abstract

Large language model (LLM) agents are increasingly expected to operate in enterprise environments, where work is distributed across specialized roles, permission-controlled systems, and cross-departmental procedures. However, existing enterprise benchmarks largely evaluate single agents with broad tool access, while existing multi-agent benchmarks rarely capture realistic enterprise constraints such as role specialization, access control, stateful business systems, and policy-based approvals. We introduce \textsc{EntCollabBench}, a benchmark for evaluating enterprise multi-agent collaboration. \textsc{EntCollabBench} simulates a permission-isolated organization with 11 role-specialized agents across six departments and contains two evaluation subsets: a Workflow subset, where agents collaboratively modify enterprise system states, and an Approval subset, where agents make…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Kirito-Lab/EntCollabBench
dataset· 70 dl
70 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.