Xiaomi EV World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving

Lijun Zhou; Hongcheng Luo; Zhenxin Zhu; Cheng Chi; Mingfei Tu; Kaixin Xiong; Lei Gong; Zhanqian Wu; Zehan Zhang; Fangzhen Li; Hao Li; Yingying Shen; Jiale He; Haohui Zhu; Shan Zhao; Kai Wang; Zhiwei Zhan; Yuechuan Pu; Kaiyuan Tan; Ruiling Yang; Xianqi Wang; Tianyi Yan; Jiawei Zhou; Lei Zhang; Jingyang Zhao; Xi Zhou; Chitian Sun; Chenming Wu; Jiong Deng; Hongwei Xie; Ming Lu; Kun Ma; Long Chen; Guang Chen; Hangjun Ye; Bing Wang; Haiyang Sun

arXiv:2605.18137·cs.CV·May 20, 2026

Xiaomi EV World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving

Lijun Zhou, Hongcheng Luo, Zhenxin Zhu, Cheng Chi, Mingfei Tu, Kaixin Xiong, Lei Gong, Zhanqian Wu, Zehan Zhang, Fangzhen Li, Hao Li, Yingying Shen, Jiale He, Haohui Zhu, Shan Zhao, Kai Wang, Zhiwei Zhan, Yuechuan Pu, Kaiyuan Tan, Ruiling Yang, Xianqi Wang, Tianyi Yan

PDF

TL;DR

This paper introduces a unified world model for autonomous driving that combines a high-fidelity 3D scene representation with a causal video generation framework, enhancing simulation and data synthesis capabilities.

Contribution

It presents a novel integrated system, JWM, combining world reconstruction and generation modules for improved stability and fidelity in autonomous driving simulations.

Findings

01

High-quality online causal video generation in as few as 4 denoising steps

02

Synergistic integration improves generation stability and cross-frame consistency

03

Provides a foundation for closed-loop simulation and data synthesis in autonomous driving

Abstract

This report presents a unified technical system addressing the two core capabilities of world models for autonomous driving: world representation and world generation. For world representation, we propose WorldRec, a feed-forward reconstruction architecture driven by sparse scene queries. WorldRec initializes structured queries in 3D space, leveraging them to aggregate cross-view, cross-temporal features, thereby naturally enforcing spatial consistency across frames and yielding compact yet high-fidelity 3D Gaussian scene representations. For world generation, we propose WorldGen, a two-stage training framework of bidirectional pretraining followed by causal fine-tuning through three progressive stages (Teacher Forcing, ODE distillation, and DMD), enabling high-quality online causal video generation in as few as 4 denoising steps. Building on both modules, we further introduce the JWM,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.