Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents

Divyanshu Saxena; Rishikesh Maurya; Xiaoxuan Ou; Gagan Somashekar; Shachee Mishra Gupta; Arun Iyer; Yu Kang; Chetan Bansal; Aditya Akella; Saravan Rajmohan

arXiv:2511.10049·cs.SE·November 14, 2025

Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents

Divyanshu Saxena, Rishikesh Maurya, Xiaoxuan Ou, Gagan Somashekar, Shachee Mishra Gupta, Arun Iyer, Yu Kang, Chetan Bansal, Aditya Akella, Saravan Rajmohan

PDF

Open Access

TL;DR

This paper introduces a continuous benchmark generation method using large language models to evaluate enterprise-scale AI agents, adapting to evolving requirements and sparse ground-truth data.

Contribution

It presents a novel process for dynamically generating benchmarks from semi-structured documents, tailored for enterprise-scale AI agent evaluation.

Findings

01

Enables rapid feedback on agent performance

02

Supports evolving requirements with maintainable benchmarks

03

Facilitates targeted improvements in AI agents

Abstract

The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and computing multiple evaluation metrics for the agent. While sufficient for simple coding tasks, these benchmarks fall short for enterprise-scale agents, where services and requirements evolve continuously and ground-truth examples are sparse. We propose a process of benchmark generation that helps evolve the benchmarks as the requirements change and perform robust evaluation of evolving AI agents. We instantiate this approach for a case study of service migration from one deployment platform to another at a large public enterprise. Our approach relies on semi-structured documents where developers express the high-level intent, and uses state-of-the-art LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · AI-based Problem Solving and Planning