X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving

Chaoda Zheng; Sean Li; Jinhao Deng; Zhennan Wang; Shijia Chen; Liqiang Xiao; Ziheng Chi; Hongbin Lin; Kangjie Chen; Boyang Wang; Yu Zhang; Xianming Liu

arXiv:2603.19979·cs.CV·April 1, 2026

X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving

Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, Yu Zhang, Xianming Liu

PDF

TL;DR

X-World is a controllable multi-camera generative world model for autonomous driving that produces realistic, long-term, multi-view video simulations with scene and appearance controls, enabling scalable evaluation.

Contribution

The paper introduces X-World, a novel multi-camera generative model that simulates future driving scenes with controllability and scene editing capabilities, advancing autonomous driving evaluation.

Findings

01

Achieves high-quality, multi-view video generation with strong view consistency.

02

Maintains stable and coherent long-term scene dynamics.

03

Supports flexible scene and appearance controls, including traffic and weather.

Abstract

Scalable and reliable evaluation is increasingly critical in the end-to-end era of autonomous driving, where vision--language--action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real-world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real-world simulator that can generate realistic future observations under proposed actions, while remaining controllable and stable over long horizons. We present X-World, an action-conditioned multi-camera generative world model that simulates future observations directly in video space. Given synchronized multi-view camera history and a future action sequence, X-World generates future multi-camera video streams that follow the commanded actions. To ensure reproducible and editable scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.