MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

Chenguo Lin; Yuchen Lin; Panwang Pan; Yifan Yu; Tao Hu; Honglei Yan; Katerina Fragkiadaki; Yadong Mu

arXiv:2507.10065·cs.CV·February 24, 2026

MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

Chenguo Lin, Yuchen Lin, Panwang Pan, Yifan Yu, Tao Hu, Honglei Yan, Katerina Fragkiadaki, Yadong Mu

PDF

Open Access 1 Models 4 Reviews

TL;DR

MoVieS is a fast, unified model that reconstructs 4D dynamic scenes from monocular videos, enabling view synthesis, geometry, and motion understanding in one second.

Contribution

It introduces a motion-aware, pixel-aligned Gaussian primitive representation for 4D scene reconstruction from monocular videos, unifying appearance, geometry, and motion modeling.

Findings

01

Achieves real-time reconstruction and view synthesis within one second.

02

Supports zero-shot applications like scene flow estimation and object segmentation.

03

Demonstrates competitive performance with significant speed advantages.

Abstract

We present MoVieS, a Motion-aware View Synthesis model that reconstructs 4D dynamic scenes from monocular videos in one second. It represents dynamic 3D scenes with pixel-aligned Gaussian primitives and explicitly supervises their time-varying motions. This allows, for the first time, the unified modeling of appearance, geometry and motion from monocular videos, and enables reconstruction, view synthesis and 3D point tracking within a single learning-based framework. By bridging view synthesis with geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

1. Efficient and Unified Approach: MOVIES integrates appearance, geometry, and motion using 3D Gaussian Splatting (3DGS), enabling fast inference while achieving competitive performance across multiple benchmarks like RealEstate10K and TAPVid-3D. 2. Zero-Shot Capabilities: The model demonstrates strong zero-shot performance in tasks such as scene flow estimation and moving object segmentation, showcasing its versatility and potential for real-world applications.

Weaknesses

1. Artifacts in Video Results: The video results show artifacts, such as blurred legs and embedded wheels, impacting visual coherence. The authors should provide a detailed discussion on the causes (e.g., model limitations or data issues) to better understand the model's performance and limitations. 2. Camera Pose Prediction: The removal of camera pose prediction, despite VGGT’s ability to estimate it, is unclear. Since the model still requires camera pose estimation, the authors should explain

Reviewer 02Rating 4Confidence 5

Strengths

* **Easy to follow** The paper is easy to follow. It is easy to understand the technical details. * **Good ablation study** Especially Figure 4 and Table 6 show ablation study on the loss function. It validates the effectiveness of the proposed ideas on motion loss, NVS loss, L1 loss, and distribution loss. * **Good empirical results on benchmark datasets** In Table 2, the method shows better accuracy than other feed-forward NVS methods (eg., DepthSplat, GS-LRM) or optimization-ba

Weaknesses

* **Possible unfair comparison in Table 3** The other methods (BootsTAPIR, CoTracker3, and SpatialTracker) possibly don't use camera pose input for the inference. Given that the proposed method uses precomputed (or given) camera pose, I wonder how fair the comparison would be. It would be curious to know i) how the evaluation is actually conducted where the other methods don't have pose information, ii) what pose estimator (or GT pose?) the proposed method uses. * **Unclear motion color co

Reviewer 03Rating 6Confidence 4

Strengths

- To my knowledge, this is the first work to address dynamic reconstruction in a feed-forward manner (at least to satisfactory quality). - The dynamic splatter representation is intuitive and theoretically sound. - The work makes good use of a strong VGGT prior. - The training regime with complementary supervision from multiple datasets is an interesting way of making use of incomplete data. - The qualitative results look visually impressive. - Similarly, the quantitative evaluation shows good p

Weaknesses

- With high computational cost, the adoption of the framework most likely depends on the code and pretrained model release. - The model is initialised with a VGGT backbone, which raises the question of how much of the performance comes from VGGT pretraining (would be a useful ablation). - It seems that training is very sensitive and needs a heavily engineered curriculum. - It would be good to perform a deeper analysis on motion supervision, e.g. 2000x oversampling on Spring seems quite strong. -

Reviewer 04Rating 2Confidence 5

Strengths

**[S1]** The model introduces minimal architectural modifications to VGGT while achieving impressive performance. By naturally extending VGGT, it demonstrates strong **novel view synthesis (NVS)** results across various datasets. The **depth** and **splat heads** directly predict **3D Gaussian Splatting (3D-GS)** representations, enabling **real-time multi-view rendering**. The **motion head** predicts **time-conditioned motion fields**, effectively modeling **dynamic splatter pixels**—a novel a

Weaknesses

[W1] **Assumption of pose-awareness**: their strong assumption of pose-awareness within the underlying scene severely limits the applicability of the model in real-world scenarios. Given that VGGT jointly estimates camera parameters along with other dense outputs such as depth, pointmaps, and tracking features, I am curious why the authors decided to exclude camera tokens from their model architecture. Since MoVieS is primarily trained on datasets where camera parameters are available (e.g., Rea

Code & Models

Models

🤗
chenguolin/MoVieS
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Motion and Animation · Computer Graphics and Visualization Techniques