Evaluating Gemini Robotics Policies in a Veo World Simulator

Gemini Robotics Team; Krzysztof Choromanski; Coline Devin; Yilun Du; Debidatta Dwibedi; Ruiqi Gao; Abhishek Jindal; Thomas Kipf; Sean Kirmani; Isabel Leal; Fangchen Liu; Anirudha Majumdar; Andrew Marmon; Carolina Parada; Yulia Rubanova; Dhruv Shah; Vikas Sindhwani; Jie Tan; Fei Xia; Ted Xiao; Sherry Yang; Wenhao Yu; Allan Zhou

arXiv:2512.10675·cs.RO·January 7, 2026

Evaluating Gemini Robotics Policies in a Veo World Simulator

Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan

PDF

Open Access

TL;DR

This paper introduces a generative evaluation system using a frontier video model (Veo) to assess robot policies across various scenarios, including out-of-distribution conditions, by synthesizing realistic scene variations.

Contribution

The paper presents a novel system leveraging a frontier video foundation model for comprehensive robot policy evaluation, including safety and generalization testing, in simulation and real-world settings.

Findings

01

Accurately predicts policy performance in nominal and OOD scenarios.

02

Enables safety probing and policy red teaming.

03

Supports realistic scene editing and multi-view consistency.

Abstract

Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Social Robot Interaction and HRI