ProSoftArena: Benchmarking Hierarchical Capabilities of Multimodal Agents in Professional Software Environments
Jiaxin Ai, Yukang Feng, Fanrui Zhang, Jianwen Sun, Zizhen Li, Chuanhao Li, Yifan Chang, Wenxiao Wu, Ruoxi Wang, Mingliang Zhai, Kaipeng Zhang

TL;DR
ProSoftArena introduces a comprehensive benchmark for evaluating multimodal agents in professional software environments, revealing current limitations and guiding future improvements in agent capabilities for real-world tasks.
Contribution
It presents the first hierarchical benchmark and platform tailored for professional software workflows, including a large set of realistic tasks and an evaluation framework with human-in-the-loop assessment.
Findings
Best agent achieves only 24.4% success on L2 tasks
Agents fail completely on L3 multi-software workflows
Provides insights for improving agent design in professional settings
Abstract
Multimodal agents are making rapid progress on general computer-use tasks, yet existing benchmarks remain largely confined to browsers and basic desktop applications, falling short in professional software workflows that dominate real-world scientific and industrial practice. To close this gap, we introduce ProSoftArena, a benchmark and platform specifically for evaluating multimodal agents in professional software environments. We establish the first capability hierarchy tailored to agent use of professional software and construct a benchmark of 436 realistic work and research tasks spanning 6 disciplines and 13 core professional applications. To ensure reliable and reproducible assessment, we build an executable real-computer environment with an execution-based evaluation framework and uniquely incorporate a human-in-the-loop evaluation paradigm. Extensive experiments show that even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques
