What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema
Mahdi Naser Moghadasi (BrightMind AI, Texas Tech University), Faezeh Ghaderi (University of Texas at Arlington)

TL;DR
This paper audits twelve LLM agent benchmark papers using a new schema, revealing significant gaps in disclosure about evaluation procedures and proposing a standardized scoring approach.
Contribution
It introduces a systematic audit schema for evaluating disclosure quality in LLM benchmark papers and applies it to twelve influential studies.
Findings
Average disclosure score for agent benchmarks is 0.38 out of 1.
None of the agent benchmark papers fully disclose inference cost.
No paper fully discloses the evaluation environment details.
Abstract
We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model name and disagree, and you cannot tell why -- the scaffold, the sampling settings, the subset, or the evaluator version. In many cases the published artifact does not let you answer. This paper is an implementation report on the attempt. We designed a small audit schema (five fields: benchmark identity, harness specification, inference settings, cost reporting, failure breakdown), wrote a scoring codebook with the boundary cases we hit during pilot scoring, applied it to twelve canonical papers (eight agent, four classical static), and recorded what we saw. We score the disclosure of an agent run, not its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
