Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting
Tingxuan Huang, Haowei Zhu, Jun-hai Yong, Hao Pan, Bin Wang

TL;DR
Mango-GS introduces a multi-frame, node-guided 4D Gaussian splatting framework that models motion dependencies with a temporal Transformer, achieving high-fidelity, temporally consistent dynamic scene reconstructions in real-time.
Contribution
The paper proposes a novel multi-frame, node-guided approach with a temporal Transformer for stable, high-quality dynamic scene reconstruction, addressing overfitting and correspondence drift issues.
Findings
Achieves state-of-the-art reconstruction quality.
Enables real-time rendering of dynamic scenes.
Provides robust motion modeling with sparse control nodes.
Abstract
Reconstructing dynamic 3D scenes with photorealistic detail and strong temporal coherence remains a significant challenge. Existing Gaussian splatting approaches for dynamic scene modeling often rely on per-frame optimization, which can overfit to instantaneous states instead of capturing underlying motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Mango-GS leverages a temporal Transformer to model motion dependencies within a short window of frames, producing temporally consistent deformations. For efficiency, temporal modeling is confined to a sparse set of control nodes. Each node is represented by a decoupled canonical position and a latent code, providing a stable semantic anchor for motion propagation and preventing correspondence drift under large motion. Our framework is trained end-to-end, enhanced…
Peer Reviews
Decision·ICLR 2026 Poster
* The decoupled node representation is a well-motivated design that elegantly addresses the neighborhood drift problem in large motion scenarios. * The multi-frame temporal attention mechanism represents a significant departure from per-frame optimization strategies prevalent in prior work. This design enables the model to learn motion patterns rather than memorize instantaneous states, leading to improved temporal coherence as evidenced by both quantitative metrics and qualitative visualization
* The theoretical justification for why the decoupled representation prevents neighborhood drift is primarily empirical. While Figure 2 provides visual evidence, a more rigorous analysis of the learned feature space and how it maintains semantic consistency under large deformations would strengthen the claims. * The motion-aware loss components ($L_{diff}, L_{dir}$) are mentioned but never formally defined in the main paper. * The evaluation focuses heavily on PSNR/SSIM metrics, but temporal co
1. Decoupling the 4DGS makes sense. 2. This paper presents a good FPS performance.
1. For the quantitative comparison in Table 1. The improvement in PSNR is limited. 2. The visualized comparison is weak. It is better to provide video comparison in supplementarials.
* The paper clearly identifies a core weakness in existing dynamic 3D Gaussian Splatting methods, namely their reliance on per-frame optimization, which causes temporal inconsistency and overfitting to instantaneous states. The motivation for introducing a multi-frame modeling framework is well justified and directly addresses this limitation. * The proposed multi-frame temporal deformation network interleaves MLP layers with temporal self-attention blocks and a gated fusion mechanism. This hyb
* My main concern regarding the design of Mango-GS lies in the insufficient justification for using a Transformer to address temporal inconsistency. First, the advantage of Transformer is its ability to capture long-range dependencies, but Table 2 shows that the optimal temporal window is only six frames. Is using a Transformer to model such a short sequence truly necessary, and does the computational cost justify the potential performance gain? Furthermore, several prior works have also explore
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
