Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

Shanshan Gao; Liyi Zhou

arXiv:2605.10448·cs.AI·May 12, 2026

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

Shanshan Gao, Liyi Zhou

PDF

TL;DR

This paper introduces an evidence reporting layer for interactive agent benchmarks that explicitly quantifies uncertainty and failure modes, improving the reliability of success evaluations.

Contribution

It proposes a method to specify outcome verification artifacts, assign evidence labels, and report score bounds without altering existing benchmarks or agents.

Findings

01

Applied to five public benchmarks, revealing distinct failure modes.

02

Explicit evidence labels improve transparency of success assessments.

03

Quantifies uncertainty in outcome detection, enhancing benchmark reliability.

Abstract

Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.