Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Zirui Tang; Xuanhe Zhou; Yumou Liu; Linchun Li; Yukai Wu; Weizheng Wang; Hongzhang Huang; Wei Zhou; Jun Zhou; Jiachen Song; Shaoli Yu; Jinqi Wang; Zihang Zhou; Hongyi Zhou; Yuting Lv; Jinyang Li; Jiashuo Liu; Ruoyu Chen; Chunwei Liu; GuoLiang Li; Jihua Kang; Fan Wu

arXiv:2605.03596·cs.AI·May 15, 2026

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Zirui Tang, Xuanhe Zhou, Yumou Liu, Linchun Li, Yukai Wu, Weizheng Wang, Hongzhang Huang, Wei Zhou, Jun Zhou, Jiachen Song, Shaoli Yu, Jinqi Wang, Zihang Zhou, Hongyi Zhou, Yuting Lv, Jinyang Li, Jiashuo Liu, Ruoyu Chen, Chunwei Liu, GuoLiang Li, Jihua Kang, Fan Wu

PDF

2 Datasets

TL;DR

Workspace-Bench 1.0 introduces a large-scale, realistic benchmark for evaluating AI agents on workspace tasks involving complex file dependencies, highlighting current limitations in agent performance.

Contribution

The paper presents a new comprehensive benchmark with realistic workspaces, extensive tasks, and evaluation of multiple models, addressing the gap in workspace-level AI evaluation.

Findings

01

Current agents achieve only about 60% accuracy, below human performance of 80.7%.

02

Average agent performance across models is only 43.3%.

03

Workspace-Bench-Lite reduces evaluation costs by 70% while maintaining distribution.

Abstract

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.