Rollout Cards: A Reproducibility Standard for Agent Research

Charlie Masters; Ziyuan Liu; Stefano V. Albrecht

arXiv:2605.12131·cs.AI·May 13, 2026

Rollout Cards: A Reproducibility Standard for Agent Research

Charlie Masters, Ziyuan Liu, Stefano V. Albrecht

PDF

TL;DR

This paper introduces rollout cards as a reproducibility standard for agent research, emphasizing the importance of preserving rollout records to ensure transparent and consistent evaluation of agent behaviors.

Contribution

It proposes rollout cards to standardize the reporting of rollout records, improving reproducibility and transparency in agent research evaluations.

Findings

01

None of the 50 repositories report failed or skipped runs.

02

Reporting rule changes can alter success rates by over 20%.

03

Regrading with different rules can invert model rankings.

Abstract

Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.