ProSoftArena: Benchmarking Hierarchical Capabilities of Multimodal Agents in Professional Software Environments

Jiaxin Ai; Yukang Feng; Fanrui Zhang; Jianwen Sun; Zizhen Li; Chuanhao Li; Yifan Chang; Wenxiao Wu; Ruoxi Wang; Mingliang Zhai; Kaipeng Zhang

arXiv:2601.02399·cs.SE·January 7, 2026

ProSoftArena: Benchmarking Hierarchical Capabilities of Multimodal Agents in Professional Software Environments

Jiaxin Ai, Yukang Feng, Fanrui Zhang, Jianwen Sun, Zizhen Li, Chuanhao Li, Yifan Chang, Wenxiao Wu, Ruoxi Wang, Mingliang Zhai, Kaipeng Zhang

PDF

Open Access

TL;DR

ProSoftArena introduces a comprehensive benchmark for evaluating multimodal agents in professional software environments, revealing current limitations and guiding future improvements in agent capabilities for real-world tasks.

Contribution

It presents the first hierarchical benchmark and platform tailored for professional software workflows, including a large set of realistic tasks and an evaluation framework with human-in-the-loop assessment.

Findings

01

Best agent achieves only 24.4% success on L2 tasks

02

Agents fail completely on L3 multi-software workflows

03

Provides insights for improving agent design in professional settings

Abstract

Multimodal agents are making rapid progress on general computer-use tasks, yet existing benchmarks remain largely confined to browsers and basic desktop applications, falling short in professional software workflows that dominate real-world scientific and industrial practice. To close this gap, we introduce ProSoftArena, a benchmark and platform specifically for evaluating multimodal agents in professional software environments. We establish the first capability hierarchy tailored to agent use of professional software and construct a benchmark of 436 realistic work and research tasks spanning 6 disciplines and 13 core professional applications. To ensure reliable and reproducible assessment, we build an executable real-computer environment with an execution-based evaluation framework and uniquely incorporate a human-in-the-loop evaluation paradigm. Extensive experiments show that even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques