Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments
Harsh Vishwakarma, Ankush Agarwal, Ojas Patil, Chaitanya Devaguptapu, Mahesh Chandran

TL;DR
This paper introduces EnterpriseBench, a comprehensive benchmark for evaluating LLM agents in enterprise environments, highlighting current limitations and opportunities for AI system improvements in complex organizational settings.
Contribution
The paper presents EnterpriseBench, a novel benchmark simulating enterprise complexities, and a data generation pipeline for creating realistic enterprise tasks for LLM evaluation.
Findings
State-of-the-art LLMs achieve only 41.8% task completion in enterprise scenarios.
Enterprise environments pose unique challenges for LLMs due to data fragmentation and access controls.
Significant room for improvement exists in developing enterprise-focused AI systems.
Abstract
Enterprise systems are crucial for enhancing productivity and decision-making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls. We present EnterpriseBench, a comprehensive benchmark that simulates enterprise settings, featuring 500 diverse tasks across software engineering, HR, finance, and administrative domains. Our benchmark uniquely captures key enterprise characteristics including data source fragmentation, access control hierarchies, and cross-functional workflows. Additionally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Artificial Intelligence in Law · Data Quality and Management
