EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
Sushant Mehta, Logan Ritchie, Suhaas Garre, Ian Niebres, Nick Heiner, Edwin Chen

TL;DR
Training AI agents in high-fidelity enterprise environments like CoreCraft enhances their ability to generalize across diverse, real-world tasks, with models showing significant improvements and transfer to out-of-distribution benchmarks.
Contribution
Introduction of CoreCraft, a realistic enterprise RL environment, and demonstration that training in such environments improves generalization of AI agents beyond training tasks.
Findings
Models trained in CoreCraft improve task pass rates from 25.37% to 36.76%.
Transfer gains of +4.5%, +7.4%, and +6.8% on out-of-distribution benchmarks.
Environment properties like diversity and realism are key to enabling generalization.
Abstract
We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
