MoDGS: Dynamic Gaussian Splatting from Casually-captured Monocular Videos with Depth Priors
Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lyv, Peng Wang, Wenping Wang, Junhui Hou

TL;DR
MoDGS introduces a novel pipeline that leverages depth priors and a 3D-aware initialization to enable high-quality novel view synthesis of dynamic scenes from casually captured monocular videos, overcoming previous camera motion limitations.
Contribution
The paper presents MoDGS, a new method that effectively reconstructs dynamic scenes from static or slowly moving camera videos using depth guidance and innovative initialization techniques.
Findings
Outperforms state-of-the-art methods significantly.
Capable of rendering high-quality views from casual monocular videos.
Effective in handling static or slow camera movements.
Abstract
In this paper, we propose MoDGS, a new pipeline to render novel views of dy namic scenes from a casually captured monocular video. Previous monocular dynamic NeRF or Gaussian Splatting methods strongly rely on the rapid move ment of input cameras to construct multiview consistency but struggle to recon struct dynamic scenes on casually captured input videos whose cameras are either static or move slowly. To address this challenging task, MoDGS adopts recent single-view depth estimation methods to guide the learning of the dynamic scene. Then, a novel 3D-aware initialization method is proposed to learn a reasonable deformation field and a new robust depth loss is proposed to guide the learning of dynamic scene geometry. Comprehensive experiments demonstrate that MoDGS is able to render high-quality novel view images of dynamic scenes from just a casually captured monocular video, which…
Peer Reviews
Decision·ICLR 2025 Poster
For the novelty, this paper makes a distinct contribution to introducing depth supervision into the domain of dynamic Gaussian Splatting (DGS) for monocular dynamic input. This approach is novel yet intuitive, filling a key gap in the field for cases where the input consists of casually captured videos with minimal camera movement. Compared to the other papers in the field that mechanically put all fancy complicated input feature streams or loss functions together, the proposed solution is conce
MoDGS is validated across several datasets, which demonstrates its robustness. However, the paper could discuss the potential limitations in generalizing this approach to different depth estimation models. It would demonstrate the robustness of the proposed method and its generalizability.
MoDGS represents an original approach within novel view synthesis and dynamic scene modeling by specifically addressing the limitations of existing methods for casually captured monocular videos. The authors introduce a 3D-aware initialization mechanism and an ordinal depth loss, that offer a solution that successfully reduces the dependency on rapid camera motion. The novel use of ordinal depth loss to maintain depth order among frames, rather than relying solely on absolute values, represent
While the ordinal depth loss is a novel way to improve depth coherence, I believe the paper may benefit from more discussion on its limitations. Specifically, the ordinal depth loss assumes a consistent depth order among frames, which may not hold in scenes with complex occlusions or reflections. MoDGS assumes smooth transitions between frames for consistent depth ordering. However, the approach may face challenges in scenes with rapid or erratic movement where objects appear and disappear frequ
* A differentiable order-based loss function, the ordinal depth loss, is proposed, with detailed descriptions of its motivation and its distinctions from other depth loss functions. * It demonstrates significant superiority over multi-view camera methods in reconstruction metrics and visual results, with ablation studies validating the importance of the "3D-aware initialization scheme" and "ordinal depth loss." * The paper is well-written and easy to follow.
* **The contributions and innovations are limited**. This work is based on the previous canonical space paradigm of 3D Gaussian Splatting (3DGS) combined with deformation fields, with the main contributions being a deformable 3DGS initialization method and a depth loss. The primary principle of the former relies on predicting per-pixel 3D flow using current state-of-the-art monocular depth estimation and optical flow estimation methods. However, the sole innovative aspect lies in converting 2D o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Optical Imaging Technologies · Infrared Target Detection Methodologies · Video Surveillance and Tracking Methods
