RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
Yash Jangir, Yidi Zhang, Pang-Chi Lo, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, and Katerina Fragkiadaki

TL;DR
RobotArena Infinity offers a scalable, reproducible benchmarking framework for robot policies by translating real-world demonstrations into simulated environments with human feedback, enabling large-scale evaluation of generalist robots.
Contribution
It introduces a novel simulation-based benchmarking approach that leverages vision-language models and human preferences to evaluate robot policies at scale.
Findings
Automated conversion of video demonstrations into simulated environments.
Scalable human preference collection for policy evaluation.
Robustness testing through environment perturbations.
Abstract
The pursuit of robot generalists, agents capable of performing diverse tasks across diverse environments, demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. We introduce RobotArena Infinity, a new benchmarking framework that overcomes these challenges by shifting vision-language-action (VLA) evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into…
Peer Reviews
Decision·ICLR 2026 Poster
- The pipeline that transforms real-world robot demonstration videos into simulated environments is very nice and generally useful. - The study conducts, according to the authors, the most extensive evaluation of generalist robot policies to date.
Real-to-sim pipeline is very nice. But for example, how is it better than testing VLAs on a bunch of different simulation environments? That would also make sure to include some out-of-distribution domains, wouldn't it? And why is this work framed as a policy evaluation work? It looks like a real-to-sim method, and it deserves credit for that contribution (more for it than for policy evaluation, because real-to-sim is more general). I think the paper has a good potential, but the following impr
1. The automated real-to-sim translation pipeline that the paper introduces is innovative. 2. The hybrid assessment method integrating VLM-based scoring with human preference feedback is comprehensive.
1. The simulation environment cannot accurately reproduce fine-grained physical interactions (e.g., plug insertion, deformable object manipulation), limiting evaluation fidelity for precision tasks. 2. The multi-stage pipeline may accumulate errors, but the paper lacks quantitative analysis of error propagation across stages. 3. The benchmark primarily focuses on static-camera, table-top manipulation tasks from datasets like Bridge and DROID, lacking coverage of dynamic scenarios, mobile navigat
1. The paper introduces a comprehensive and automated pipeline that seamlessly bridges real-world robot data and simulation, enabling scalable and reproducible evaluation of vision-language-action models. 2. It conducts the largest cross-lab evaluation to date, providing unprecedented insights into the generalization capabilities and limitations of current generalist robot policies under diverse distribution shifts. 3. The paper is well-written, presenting a complex technical system with concept
My main concerns lie in the proposed real2sim pipeline, including: 1. The 3D assets are primarily from some 3D generation models, which may produce meshes with different shape with real objects, or have implausible collisions. Moreover, the physical parameters are given by some LLMs, which can also result in some implausible physical movements in simulation. 2. Inpainting background makes the camera viewpoint in this evaluation process remain the same as, or close to, that in the original video
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
