EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents
Ying Mo, Yu Bai, Dapeng Sun, Yuqian Shi, Yukai Miao, Li Chen, Dan Li

TL;DR
EntWorld introduces a comprehensive benchmark for evaluating AI agents in complex enterprise environments, emphasizing realistic workflows, strict logic, and state verification, revealing significant gaps in current model capabilities.
Contribution
The paper presents EntWorld, a large-scale, schema-grounded enterprise benchmark with a novel SQL-based verification system, addressing limitations of previous datasets and fostering enterprise-specific AI development.
Findings
State-of-the-art models achieve only 47.61% success rate on EntWorld.
EntWorld reveals a substantial gap between current AI agents and human performance in enterprise tasks.
The benchmark enables rigorous evaluation of domain-specific agent capabilities.
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled agents to operate in open-ended web and operating system environments. However, existing benchmarks predominantly target consumer-oriented scenarios (e.g., e-commerce and travel booking), failing to capture the complexity and rigor of professional enterprise workflows. Enterprise systems pose distinct challenges, including high-density user interfaces, strict business logic constraints, and a strong reliance on precise, state-consistent information retrieval-settings in which current generalist agents often struggle. To address this gap, we introduce EntWorld, a large-scale benchmark consisting of 1,756 tasks across six representative enterprise domains, including customer relationship management (CRM), information technology infrastructure library (ITIL), and enterprise resource planning (ERP) systems. Unlike…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
