Seeing World Dynamics in a Nutshell
Qiuhong Shen, Xuanyu Yi, Mingbao Lin, Hanwang Zhang, Shuicheng Yan,, Xinchao Wang

TL;DR
NutWorld is a novel framework that converts monocular videos into continuous 3D Gaussian representations, enabling efficient, high-fidelity, and real-time scene modeling without optimization.
Contribution
It introduces the STAG representation and a single-pass transformation method for monocular videos into dynamic 3D Gaussian models, improving coherence and efficiency.
Findings
High-fidelity video reconstruction achieved
Real-time scene modeling demonstrated
Effective depth and flow regularization implemented
Abstract
We consider the problem of efficiently representing casually captured monocular videos in a spatially- and temporally-coherent manner. While existing approaches predominantly rely on 2D/2.5D techniques treating videos as collections of spatiotemporal pixels, they struggle with complex motions, occlusions, and geometric consistency due to absence of temporal coherence and explicit 3D structure. Drawing inspiration from monocular video as a projection of the dynamic 3D world, we explore representing videos in their intrinsic 3D form through continuous flows of Gaussian primitives in space-time. In this paper, we propose NutWorld, a novel framework that efficiently transforms monocular videos into dynamic 3D Gaussian representations in a single forward pass. At its core, NutWorld introduces a structured spatial-temporal aligned Gaussian (STAG) representation, enabling optimization-free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHistorical Geography and Cartography
