Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

David Snyder; Apurva Badithela; Nikolai Matni; George Pappas; Anirudha Majumdar; Masha Itkina; Haruki Nishimura

arXiv:2603.13616·cs.RO·March 17, 2026

Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

David Snyder, Apurva Badithela, Nikolai Matni, George Pappas, Anirudha Majumdar, Masha Itkina, Haruki Nishimura

PDF

Open Access

TL;DR

This paper introduces a sample-efficient, statistically rigorous framework for robot policy comparison that handles various evaluation metrics and reduces testing effort significantly.

Contribution

It presents a unified, sequential testing procedure based on safe, anytime-valid inference applicable to diverse performance metrics in robot policy evaluation.

Findings

01

Up to 70% reduction in evaluation burden compared to standard methods.

02

Up to 50% reduction compared to existing binary-focused sequential procedures.

03

More rapid policy differentiation using fine-grained task progress metrics.

Abstract

Generalist robot manipulation policies are becoming increasingly capable, but are limited in evaluation to a small number of hardware rollouts. This strong resource constraint in real-world testing necessitates both more informative performance measures and reliable and efficient evaluation procedures to properly assess model capabilities and benchmark progress in the field. This work presents a novel framework for robot policy comparison that is sample-efficient, statistically rigorous, and applicable to a broad set of evaluation metrics used in practice. Based on safe, anytime-valid inference (SAVI), our test procedure is sequential, allowing the evaluator to stop early when sufficient statistical evidence has accumulated to reach a decision at a pre-specified level of confidence. Unlike previous work developed for binary success, our unified approach addresses a wide range of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Adversarial Robustness in Machine Learning