Geometry-aware 4D Video Generation for Robot Manipulation
Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, Shuran Song

TL;DR
This paper introduces a geometry-aware 4D video generation model that enforces multi-view 3D consistency, enabling robots to predict and utilize future spatio-temporal scenes for improved manipulation in complex environments.
Contribution
The paper proposes a novel 4D video generation approach that enforces multi-view geometric consistency, allowing for view-independent scene prediction without camera pose inputs.
Findings
Generated videos are more visually stable and spatially aligned across views.
The method enables recovery of robot trajectories from predicted videos.
Robotic policies generalize well to new viewpoints using the generated videos.
Abstract
Understanding and predicting dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually…
Peer Reviews
Decision·ICLR 2026 Poster
I found this to be a well-written paper, with a clear formulation and convincing empirical results. There's reasonably thorough evaluation across stages: for both 4D generation and for learned robot policies. I think the robot learning community would benefit from these results; it's a nice existence proof for one way video/world models might be useful for robot learning.
While the baselines are reasonable, they do feel slightly "set up to fail". For video generation the baselines are RGB only while the proposed method is trained on RGB-D; same for policy rollout. The core message of the paper, however, is in part that the geometry component is critical so perhaps this is fine. While the approach is elegant and the results are strong, the system depends heavily on multi-view RGB-D data and an external pose tracker (FoundationPose). This raises questions about sc
1. Originality: The integration of cross-view pointmap alignment into a video diffusion model for 4D generation is novel. Unlike prior 4D methods that assume known camera poses or operate on static scenes, this work handles dynamic, multi-object manipulation without pose inputs at test time. 2. Quality: Experiments are thorough, covering both simulation (LBM) and real-world domains, with ablations, multiple baselines (SVD variants, 4D Gaussian), and downstream policy evaluation. 3. Clarity: The
1. Computational Cost: Inference takes ~30 seconds per 10-frame rollout (Table 3), which limits real-time or closed-loop use. While acknowledged in §5, more discussion on latency-accuracy trade-offs or potential optimizations (e.g., sparse prediction, distillation) would strengthen practical impact. 2. Real-World Data Scale: The real-world dataset includes only 20 demos per task. While fine-tuning from simulation helps, it’s unclear how performance scales with more diverse real data or more com
1. The method achieves geometric consistency in the generated videos across multiple views through cross-view geometric supervision and cross-attention mechanism. 2. The model shows a significant improvement in task success rate for manipulation tasks compared to baseline methods. 3. The generated 4D videos can be directly combined with an off-the-shelf 6DoF pose tracker to extract robot end-effector trajectories.
1. The authors need to fully explain the difference between their proposed method and the joint spatio-temporal consistency optimization methods mentioned in references [3, 4]. The authors should specifically show how the proposed approach is better suited for the multi-object, dynamic robot manipulation scenes. 2. In Table 1, the $FVD-{n}$ scores for Task 2 and Task 4 are not significantly different from the SVD finetuning baseline. More results from additional tasks are needed to verify the ad
Code & Models
Videos
Taxonomy
TopicsRobot Manipulation and Learning · Advanced Vision and Imaging · Human Pose and Action Recognition
