Establishing Best Practices for Building Rigorous Agentic Benchmarks

Yuxuan Zhu; Tengjun Jin; Yada Pruksachatkun; Andy Zhang; Shu Liu; Sasha Cui; Sayash Kapoor; Shayne Longpre; Kevin Meng; Rebecca Weiss; Fazl Barez; Rahul Gupta; Jwala Dhamala; Jacob Merizian; Mario Giulianelli; Harry Coppock; Cozmin Ududec; Jasjeet Sekhon; Jacob Steinhardt; Antony Kellermann; Sarah Schwettmann; Matei Zaharia; Ion Stoica; Percy Liang; Daniel Kang

arXiv:2507.02825·cs.AI·August 8, 2025

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt

PDF

1 Video

TL;DR

This paper identifies issues in current agentic benchmarks and introduces the ABC checklist to improve their rigor, significantly reducing performance overestimations in complex evaluations.

Contribution

The paper presents the Agentic Benchmark Checklist (ABC), a set of guidelines to improve the design and evaluation of agentic benchmarks in AI.

Findings

01

ABC reduces performance overestimation by 33% in CVE-Bench.

02

Many existing benchmarks have issues like insufficient test cases and miscounted successes.

03

Applying ABC improves the reliability of agent evaluations.

Abstract

Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues in task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, while TAU-bench counts empty responses as successful. Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Establishing Best Practices for Building Rigorous Agentic Benchmarks· underline