RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

Pranav Atreya; Karl Pertsch; Tony Lee; Moo Jin Kim; Arhan Jain; Artur Kuramshin; Clemens Eppner; Cyrus Neary; Edward Hu; Fabio Ramos; Jonathan Tremblay; Kanav Arora; Kirsty Ellis; Luca Macesanu; Marcel Torne Villasevil; Matthew Leonard; Meedeum Cho; Ozgur Aslan; Shivin Dass; Jie Wang; William Reger; Xingfang Yuan; Xuning Yang; Abhishek Gupta; Dinesh Jayaraman; Glen Berseth; Kostas Daniilidis; Roberto Martin-Martin; Youngwoon Lee; Percy Liang; Chelsea Finn; Sergey Levine

arXiv:2506.18123·cs.RO·December 2, 2025

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, Jonathan Tremblay, Kanav Arora, Kirsty Ellis, Luca Macesanu, Marcel Torne Villasevil, Matthew Leonard, Meedeum Cho, Ozgur Aslan, Shivin Dass

PDF

2 Datasets

TL;DR

RoboArena introduces a scalable, crowdsourced framework for evaluating generalist robot policies in real-world settings by aggregating pairwise preferences across diverse tasks and environments, improving ranking accuracy over traditional methods.

Contribution

It proposes a novel distributed evaluation approach that crowdsources policy assessments, enabling scalable, diverse, and unbiased benchmarking of generalist robot policies in real-world scenarios.

Findings

01

Crowdsourced evaluations outperform centralized methods in policy ranking accuracy.

02

Over 600 pairwise robot evaluations across seven policies demonstrate scalability and reliability.

03

The approach enhances transparency and community access to policy benchmarking.

Abstract

Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ''robot challenges'', and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.