EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Sushant Mehta; Logan Ritchie; Suhaas Garre; Ian Niebres; Nick Heiner; Edwin Chen

arXiv:2602.16179·cs.AI·March 3, 2026

EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Sushant Mehta, Logan Ritchie, Suhaas Garre, Ian Niebres, Nick Heiner, Edwin Chen

PDF

Open Access

TL;DR

Training AI agents in high-fidelity enterprise environments like CoreCraft enhances their ability to generalize across diverse, real-world tasks, with models showing significant improvements and transfer to out-of-distribution benchmarks.

Contribution

Introduction of CoreCraft, a realistic enterprise RL environment, and demonstration that training in such environments improves generalization of AI agents beyond training tasks.

Findings

01

Models trained in CoreCraft improve task pass rates from 25.37% to 36.76%.

02

Transfer gains of +4.5%, +7.4%, and +6.8% on out-of-distribution benchmarks.

03

Environment properties like diversity and realism are key to enabling generalization.

Abstract

We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)