Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
Shanshan Gao, Liyi Zhou

TL;DR
This paper introduces an evidence reporting layer for interactive agent benchmarks that explicitly quantifies uncertainty and failure modes, improving the reliability of success evaluations.
Contribution
It proposes a method to specify outcome verification artifacts, assign evidence labels, and report score bounds without altering existing benchmarks or agents.
Findings
Applied to five public benchmarks, revealing distinct failure modes.
Explicit evidence labels improve transparency of success assessments.
Quantifies uncertainty in outcome detection, enhancing benchmark reliability.
Abstract
Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
