Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, Guanghui Ren

TL;DR
Genie Envisioner is a comprehensive platform for robotic manipulation that combines a video diffusion model, a policy decoder, a neural simulator, and a benchmark suite to enable scalable, instruction-driven embodied intelligence.
Contribution
It introduces a unified framework integrating policy learning, evaluation, and simulation in a single video-generative platform for robotic manipulation.
Findings
High-fidelity, instruction-conditioned video generation of robotic interactions
Generalizable policy inference across diverse robotic embodiments
A scalable neural simulator for closed-loop policy development
Abstract
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized…
Peer Reviews
Decision·ICLR 2026 Poster
The multi-stage training especially with different varying frequencies is a clever idea to make the learned video prediction more robust to different execution speeds. The use of a FiLM style injection of the latents into the policy architechture rather than just using the last layer output is quite a nice way to ensure multi-layered information to be used more effectively for the downstream policy learning. The quality of the experiments are high and thorough. The paper is written clearly and u
- I did not fully understand what exactly the multi-view consistency is. Is it just the joint prediction of head, left and right video feeds? Or is in the structure of the attention layers used in the model? - The approach requires few-shot fine-tuning for any new robotic embodiment. While this is a problem in general with VLA style policy networks, I would imagine that the video generation model, if trained on a variety of robotic embodiments, such as from Open-X-embodiments or similar datasets
1. The paper demonstrates strong engineering efforts, including large-scale data and model training. 2. This experimental performance is strong.
1. The macro-level architecture design—an action model built on top of world model representations—was, to my knowledge, first introduced in the GR-1 and GR-2 series. However, these prior works are not cited or discussed, which weakens the contextual positioning of Genie Envisioner within the literature. 2. The paper’s presentation resembles a technical report more than a polished conference paper. Although it comprehensively covers implementation aspects, it fails to emphasize the key innovatio
- Well-presented paper that provides comprehensive details and well-illustrated figures, enabling readers to fully understand the task. - GE-Base supports multi-view observation, which is quite helpful for robot planning tasks. - I like the idea of asynchronous inference, which provides a practical solution for incorporating the world model and policy. -The video results are impressive, and the extensive experiments prove the effectiveness of GE.
- GE-Base and GE-Act are designed for a specific robot configuration (one head camera and two wrist cameras). This weakens the potential of this pipeline to be leveraged in the general robotics community. - The two-stage VLA training pipeline, especially the specific task fine-tuning in the second stage, does not showcase the generalizability of the proposed policy. Is there a large performance drop when tested on out-of-distribution tasks? - I am missing a comparison between single-view and mul
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
