EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents

Ying Mo; Yu Bai; Dapeng Sun; Yuqian Shi; Yukai Miao; Li Chen; Dan Li

arXiv:2601.17722·cs.AI·January 27, 2026

EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents

Ying Mo, Yu Bai, Dapeng Sun, Yuqian Shi, Yukai Miao, Li Chen, Dan Li

PDF

Open Access

TL;DR

EntWorld introduces a comprehensive benchmark for evaluating AI agents in complex enterprise environments, emphasizing realistic workflows, strict logic, and state verification, revealing significant gaps in current model capabilities.

Contribution

The paper presents EntWorld, a large-scale, schema-grounded enterprise benchmark with a novel SQL-based verification system, addressing limitations of previous datasets and fostering enterprise-specific AI development.

Findings

01

State-of-the-art models achieve only 47.61% success rate on EntWorld.

02

EntWorld reveals a substantial gap between current AI agents and human performance in enterprise tasks.

03

The benchmark enables rigorous evaluation of domain-specific agent capabilities.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled agents to operate in open-ended web and operating system environments. However, existing benchmarks predominantly target consumer-oriented scenarios (e.g., e-commerce and travel booking), failing to capture the complexity and rigor of professional enterprise workflows. Enterprise systems pose distinct challenges, including high-density user interfaces, strict business logic constraints, and a strong reliance on precise, state-consistent information retrieval-settings in which current generalist agents often struggle. To address this gap, we introduce EntWorld, a large-scale benchmark consisting of 1,756 tasks across six representative enterprise domains, including customer relationship management (CRM), information technology infrastructure library (ITIL), and enterprise resource planning (ERP) systems. Unlike…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques