AlphaEval: Evaluating Agents in Production

Pengrui Lu; Bingyu Xu; Wenjun Zhang; Shengjia Hua; Xuanjian Gao; Ranxiang Ge; Lyumanshan Ye; Linxuan Wu; Yiran Li; Junfei Fish Yu; Yibo Zhang; Ruixin Li; Manxiang Li; Xiao Han; Xiaocong Zhou; Guangyao Chi; Zisheng Chen; Kaishen Chen; Kun Wang; Qihua Xu; Fengyue Meng; Yuchen Ni; Jiajun Li; Jinxiu Liu; Danfeng Zhang; Jingru Zhao; Pengfei Liu

arXiv:2604.12162·cs.CL·April 15, 2026

AlphaEval: Evaluating Agents in Production

Pengrui Lu, Bingyu Xu, Wenjun Zhang, Shengjia Hua, Xuanjian Gao, Ranxiang Ge, Lyumanshan Ye, Linxuan Wu, Yiran Li, Junfei Fish Yu, Yibo Zhang, Ruixin Li, Manxiang Li, Xiao Han, Xiaocong Zhou, Guangyao Chi, Zisheng Chen, Kaishen Chen, Kun Wang, Qihua Xu, Fengyue Meng, Yuchen Ni

PDF

TL;DR

AlphaEval introduces a comprehensive, production-realistic benchmark of 94 tasks from seven companies, evaluating complete AI agents across diverse real-world scenarios and establishing a systematic methodology for creating such benchmarks.

Contribution

The paper presents a novel production-grounded benchmark and a systematic framework to transform real-world requirements into evaluation tasks efficiently.

Findings

01

Evaluates complete AI agents as commercial systems, revealing performance variations unseen in model-centric benchmarks.

02

Includes diverse evaluation paradigms such as LLM-as-a-Judge, formal verification, and UI testing.

03

Provides a reproducible methodology for organizations to create their own production-relevant benchmarks.

Abstract

The rapid deployment of AI agents in commercial settings has outpaced the development of evaluation methodologies that reflect production realities. Existing benchmarks measure agent capabilities through retrospectively curated tasks with well-specified requirements and deterministic metrics -- conditions that diverge fundamentally from production environments where requirements contain implicit constraints, inputs are heterogeneous multi-modal documents with information fragmented across sources, tasks demand undeclared domain expertise, outputs are long-horizon professional deliverables, and success is judged by domain experts whose standards evolve over time. We present AlphaEval, a production-grounded benchmark of 94 tasks sourced from seven companies deploying AI agents in their core business, spanning six O*NET (Occupational Information Network) domains. Unlike model-centric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.