Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Yue Liao; Pengfei Zhou; Siyuan Huang; Donglin Yang; Shengcong Chen; Yuxin Jiang; Yue Hu; Jingbin Cai; Si Liu; Jianlan Luo; Liliang Chen; Shuicheng Yan; Maoqing Yao; Guanghui Ren

arXiv:2508.05635·cs.RO·November 5, 2025

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, Guanghui Ren

PDF

1 Models 3 Reviews

TL;DR

Genie Envisioner is a comprehensive platform for robotic manipulation that combines a video diffusion model, a policy decoder, a neural simulator, and a benchmark suite to enable scalable, instruction-driven embodied intelligence.

Contribution

It introduces a unified framework integrating policy learning, evaluation, and simulation in a single video-generative platform for robotic manipulation.

Findings

01

High-fidelity, instruction-conditioned video generation of robotic interactions

02

Generalizable policy inference across diverse robotic embodiments

03

A scalable neural simulator for closed-loop policy development

Abstract

We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

The multi-stage training especially with different varying frequencies is a clever idea to make the learned video prediction more robust to different execution speeds. The use of a FiLM style injection of the latents into the policy architechture rather than just using the last layer output is quite a nice way to ensure multi-layered information to be used more effectively for the downstream policy learning. The quality of the experiments are high and thorough. The paper is written clearly and u

Weaknesses

- I did not fully understand what exactly the multi-view consistency is. Is it just the joint prediction of head, left and right video feeds? Or is in the structure of the attention layers used in the model? - The approach requires few-shot fine-tuning for any new robotic embodiment. While this is a problem in general with VLA style policy networks, I would imagine that the video generation model, if trained on a variety of robotic embodiments, such as from Open-X-embodiments or similar datasets

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper demonstrates strong engineering efforts, including large-scale data and model training. 2. This experimental performance is strong.

Weaknesses

1. The macro-level architecture design—an action model built on top of world model representations—was, to my knowledge, first introduced in the GR-1 and GR-2 series. However, these prior works are not cited or discussed, which weakens the contextual positioning of Genie Envisioner within the literature. 2. The paper’s presentation resembles a technical report more than a polished conference paper. Although it comprehensively covers implementation aspects, it fails to emphasize the key innovatio

Reviewer 03Rating 8Confidence 4

Strengths

- Well-presented paper that provides comprehensive details and well-illustrated figures, enabling readers to fully understand the task. - GE-Base supports multi-view observation, which is quite helpful for robot planning tasks. - I like the idea of asynchronous inference, which provides a practical solution for incorporating the world model and policy. -The video results are impressive, and the extensive experiments prove the effectiveness of GE.

Weaknesses

- GE-Base and GE-Act are designed for a specific robot configuration (one head camera and two wrist cameras). This weakens the potential of this pipeline to be leveraged in the general robotics community. - The two-stage VLA training pipeline, especially the specific task fine-tuning in the second stage, does not showcase the generalizability of the proposed policy. Is there a large performance drop when tested on out-of-distribution tasks? - I am missing a comparison between single-view and mul

Code & Models

Models

🤗
agibot-world/Genie-Envisioner
model· 3 dl· ♡ 8
3 dl♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.