Rollout Cards: A Reproducibility Standard for Agent Research
Charlie Masters, Ziyuan Liu, Stefano V. Albrecht

TL;DR
This paper introduces rollout cards as a reproducibility standard for agent research, emphasizing the importance of preserving rollout records to ensure transparent and consistent evaluation of agent behaviors.
Contribution
It proposes rollout cards to standardize the reporting of rollout records, improving reproducibility and transparency in agent research evaluations.
Findings
None of the 50 repositories report failed or skipped runs.
Reporting rule changes can alter success rates by over 20%.
Regrading with different rules can invert model rankings.
Abstract
Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
