EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Shiva Krishna Reddy Malay; Shravan Nayak; Jishnu Sethumadhavan Nair; Sagar Davasam; Aman Tiwari; Sathwik Tejaswi Madhusudhan; Sridhar Krishna Nemala; Srinivas Sunkara; Sai Rajeswar

arXiv:2603.13594·cs.AI·March 17, 2026

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, Sai Rajeswar

PDF

Open Access 4 Datasets

TL;DR

EnterpriseOps-Gym is a benchmark environment designed to evaluate the ability of large language models to perform complex, long-horizon planning tasks in realistic enterprise settings, revealing current limitations and guiding future improvements.

Contribution

The paper introduces EnterpriseOps-Gym, a comprehensive benchmark with realistic tasks and environment for assessing agentic planning in enterprise contexts, highlighting key challenges and performance gaps.

Findings

01

Top model success rate is only 37.4%.

02

Oracle human plans improve performance by 14-35%.

03

Models often fail to refuse infeasible tasks, risking harmful outcomes.

Abstract

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Multimodal Machine Learning Applications · Multi-Agent Systems and Negotiation