MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

Jiaxu Wang; Yicheng Jiang; Tianlun He; Jingkai Sun; Qiang Zhang; Junhao He; Jiahang Cao; Zesen Gan; Mingyuan Sun; Qiming Shao; Xiangyu Yue

arXiv:2602.09878·cs.CV·February 11, 2026

MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

Jiaxu Wang, Yicheng Jiang, Tianlun He, Jingkai Sun, Qiang Zhang, Junhao He, Jiahang Cao, Zesen Gan, Mingyuan Sun, Qiming Shao, Xiangyu Yue

PDF

Open Access

TL;DR

This paper introduces MVISTA-4D, a novel 4D world model for robotic manipulation that generates consistent multi-view RGBD sequences from a single view and infers actions through test-time optimization, improving scene understanding and manipulation accuracy.

Contribution

It presents a new embodied 4D world model with view-consistent RGBD generation and a test-time action inference method, advancing scene prediction and robotic control.

Findings

01

Strong performance on 4D scene generation tasks

02

Effective action inference via test-time optimization

03

Improved manipulation accuracy across datasets

Abstract

World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis