ORV: 4D Occupancy-centric Robot Video Generation
Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Zheng Zhu, Xin Jin, Hang Zhao, Hao Zhao

TL;DR
ORV introduces a 4D occupancy-centric framework for robot video generation that enhances fidelity, temporal consistency, and controllability by integrating action priors and occupancy-based visual guidance, supported by a new large-scale dataset.
Contribution
The paper presents ORV, a novel 4D occupancy-based approach for robot video synthesis that improves realism and control, and curates a large-scale 4D occupancy dataset for embodied scenarios.
Findings
Achieves 18.8% lower FVD than state-of-the-art methods.
Improves success rate by 3.5% on visual planning tasks.
Enhances success rate by 6.4% on policy learning.
Abstract
Recent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current action-conditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D occupancy-centric framework for robot video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Human Motion and Animation · Robotic Path Planning Algorithms
