Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents
Divyanshu Saxena, Rishikesh Maurya, Xiaoxuan Ou, Gagan Somashekar, Shachee Mishra Gupta, Arun Iyer, Yu Kang, Chetan Bansal, Aditya Akella, Saravan Rajmohan

TL;DR
This paper introduces a continuous benchmark generation method using large language models to evaluate enterprise-scale AI agents, adapting to evolving requirements and sparse ground-truth data.
Contribution
It presents a novel process for dynamically generating benchmarks from semi-structured documents, tailored for enterprise-scale AI agent evaluation.
Findings
Enables rapid feedback on agent performance
Supports evolving requirements with maintainable benchmarks
Facilitates targeted improvements in AI agents
Abstract
The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and computing multiple evaluation metrics for the agent. While sufficient for simple coding tasks, these benchmarks fall short for enterprise-scale agents, where services and requirements evolve continuously and ground-truth examples are sparse. We propose a process of benchmark generation that helps evolve the benchmarks as the requirements change and perform robust evaluation of evolving AI agents. We instantiate this approach for a case study of service migration from one deployment platform to another at a large public enterprise. Our approach relies on semi-structured documents where developers express the high-level intent, and uses state-of-the-art LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · AI-based Problem Solving and Planning
