MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
Chenguo Lin, Yuchen Lin, Panwang Pan, Yifan Yu, Tao Hu, Honglei Yan, Katerina Fragkiadaki, Yadong Mu

TL;DR
MoVieS is a fast, unified model that reconstructs 4D dynamic scenes from monocular videos, enabling view synthesis, geometry, and motion understanding in one second.
Contribution
It introduces a motion-aware, pixel-aligned Gaussian primitive representation for 4D scene reconstruction from monocular videos, unifying appearance, geometry, and motion modeling.
Findings
Achieves real-time reconstruction and view synthesis within one second.
Supports zero-shot applications like scene flow estimation and object segmentation.
Demonstrates competitive performance with significant speed advantages.
Abstract
We present MoVieS, a Motion-aware View Synthesis model that reconstructs 4D dynamic scenes from monocular videos in one second. It represents dynamic 3D scenes with pixel-aligned Gaussian primitives and explicitly supervises their time-varying motions. This allows, for the first time, the unified modeling of appearance, geometry and motion from monocular videos, and enables reconstruction, view synthesis and 3D point tracking within a single learning-based framework. By bridging view synthesis with geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Efficient and Unified Approach: MOVIES integrates appearance, geometry, and motion using 3D Gaussian Splatting (3DGS), enabling fast inference while achieving competitive performance across multiple benchmarks like RealEstate10K and TAPVid-3D. 2. Zero-Shot Capabilities: The model demonstrates strong zero-shot performance in tasks such as scene flow estimation and moving object segmentation, showcasing its versatility and potential for real-world applications.
1. Artifacts in Video Results: The video results show artifacts, such as blurred legs and embedded wheels, impacting visual coherence. The authors should provide a detailed discussion on the causes (e.g., model limitations or data issues) to better understand the model's performance and limitations. 2. Camera Pose Prediction: The removal of camera pose prediction, despite VGGT’s ability to estimate it, is unclear. Since the model still requires camera pose estimation, the authors should explain
* **Easy to follow** The paper is easy to follow. It is easy to understand the technical details. * **Good ablation study** Especially Figure 4 and Table 6 show ablation study on the loss function. It validates the effectiveness of the proposed ideas on motion loss, NVS loss, L1 loss, and distribution loss. * **Good empirical results on benchmark datasets** In Table 2, the method shows better accuracy than other feed-forward NVS methods (eg., DepthSplat, GS-LRM) or optimization-ba
* **Possible unfair comparison in Table 3** The other methods (BootsTAPIR, CoTracker3, and SpatialTracker) possibly don't use camera pose input for the inference. Given that the proposed method uses precomputed (or given) camera pose, I wonder how fair the comparison would be. It would be curious to know i) how the evaluation is actually conducted where the other methods don't have pose information, ii) what pose estimator (or GT pose?) the proposed method uses. * **Unclear motion color co
- To my knowledge, this is the first work to address dynamic reconstruction in a feed-forward manner (at least to satisfactory quality). - The dynamic splatter representation is intuitive and theoretically sound. - The work makes good use of a strong VGGT prior. - The training regime with complementary supervision from multiple datasets is an interesting way of making use of incomplete data. - The qualitative results look visually impressive. - Similarly, the quantitative evaluation shows good p
- With high computational cost, the adoption of the framework most likely depends on the code and pretrained model release. - The model is initialised with a VGGT backbone, which raises the question of how much of the performance comes from VGGT pretraining (would be a useful ablation). - It seems that training is very sensitive and needs a heavily engineered curriculum. - It would be good to perform a deeper analysis on motion supervision, e.g. 2000x oversampling on Spring seems quite strong. -
**[S1]** The model introduces minimal architectural modifications to VGGT while achieving impressive performance. By naturally extending VGGT, it demonstrates strong **novel view synthesis (NVS)** results across various datasets. The **depth** and **splat heads** directly predict **3D Gaussian Splatting (3D-GS)** representations, enabling **real-time multi-view rendering**. The **motion head** predicts **time-conditioned motion fields**, effectively modeling **dynamic splatter pixels**—a novel a
[W1] **Assumption of pose-awareness**: their strong assumption of pose-awareness within the underlying scene severely limits the applicability of the model in real-world scenarios. Given that VGGT jointly estimates camera parameters along with other dense outputs such as depth, pointmaps, and tracking features, I am curious why the authors decided to exclude camera tokens from their model architecture. Since MoVieS is primarily trained on datasets where camera parameters are available (e.g., Rea
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Motion and Animation · Computer Graphics and Visualization Techniques
