World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems

Lakshya Gupta; Litao Li; Yizhe Liu; Sriram Ganapathi Subramanian; Kaheer Suleman; Zichen Zhang; Haoye Lu; Sumit Pasupalak

arXiv:2601.22130·cs.AI·February 12, 2026

World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems

Lakshya Gupta, Litao Li, Yizhe Liu, Sriram Ganapathi Subramanian, Kaheer Suleman, Zichen Zhang, Haoye Lu, Sumit Pasupalak

PDF

Open Access

TL;DR

This paper introduces World of Workflows (WoW), a benchmark environment for testing large language models in complex enterprise systems with hidden workflows, revealing their limitations and emphasizing the need for grounded world modeling.

Contribution

It presents a realistic enterprise benchmark with 4,000+ rules and 55 workflows, highlighting the importance of system dynamics modeling for reliable enterprise agents.

Findings

01

LLMs fail to predict cascading side effects in enterprise workflows.

02

Reliability requires agents to simulate hidden state transitions.

03

WoW provides a new environment for evaluating enterprise system understanding.

Abstract

Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Business Process Modeling and Analysis · Software System Performance and Reliability