BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
Xinming Tu, Tianze Wang, Yingzhou (Minta) Lu, Kexin Huang, Yuanhao Qu, and Sara Mostafavi

TL;DR
BenchGuard introduces an automated framework using frontier LLMs to audit and verify the integrity of complex agent benchmarks, uncovering issues missed by human review and improving evaluation reliability.
Contribution
This work presents the first automated auditing framework for task-oriented benchmarks, leveraging LLMs to systematically verify benchmark artifacts and identify defects.
Findings
Identified 12 issues in ScienceAgentBench, including fatal errors.
Matched 83.3% of issues on BIXBench Verified-50, catching defects missed by humans.
Full audit of 50 bioinformatics tasks costs under USD 15.
Abstract
As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated auditing framework for task-oriented, execution-based agent benchmarks. BenchGuard cross-verifies all benchmark artifacts via structured LLM protocols, optionally incorporating agent solutions or execution traces as additional diagnostic evidence. Deployed on two prominent scientific benchmarks, BenchGuard identified 12 author-confirmed issues in ScienceAgentBench - including fatal errors rendering tasks unsolvable - and exactly matched 83.3% of expert-identified issues on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
