Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting

Tingxuan Huang; Haowei Zhu; Jun-hai Yong; Hao Pan; Bin Wang

arXiv:2603.11543·cs.CV·March 13, 2026

Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting

Tingxuan Huang, Haowei Zhu, Jun-hai Yong, Hao Pan, Bin Wang

PDF

Open Access 3 Reviews

TL;DR

Mango-GS introduces a multi-frame, node-guided 4D Gaussian splatting framework that models motion dependencies with a temporal Transformer, achieving high-fidelity, temporally consistent dynamic scene reconstructions in real-time.

Contribution

The paper proposes a novel multi-frame, node-guided approach with a temporal Transformer for stable, high-quality dynamic scene reconstruction, addressing overfitting and correspondence drift issues.

Findings

01

Achieves state-of-the-art reconstruction quality.

02

Enables real-time rendering of dynamic scenes.

03

Provides robust motion modeling with sparse control nodes.

Abstract

Reconstructing dynamic 3D scenes with photorealistic detail and strong temporal coherence remains a significant challenge. Existing Gaussian splatting approaches for dynamic scene modeling often rely on per-frame optimization, which can overfit to instantaneous states instead of capturing underlying motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Mango-GS leverages a temporal Transformer to model motion dependencies within a short window of frames, producing temporally consistent deformations. For efficiency, temporal modeling is confined to a sparse set of control nodes. Each node is represented by a decoupled canonical position and a latent code, providing a stable semantic anchor for motion propagation and preventing correspondence drift under large motion. Our framework is trained end-to-end, enhanced…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

* The decoupled node representation is a well-motivated design that elegantly addresses the neighborhood drift problem in large motion scenarios. * The multi-frame temporal attention mechanism represents a significant departure from per-frame optimization strategies prevalent in prior work. This design enables the model to learn motion patterns rather than memorize instantaneous states, leading to improved temporal coherence as evidenced by both quantitative metrics and qualitative visualization

Weaknesses

* The theoretical justification for why the decoupled representation prevents neighborhood drift is primarily empirical. While Figure 2 provides visual evidence, a more rigorous analysis of the learned feature space and how it maintains semantic consistency under large deformations would strengthen the claims. * The motion-aware loss components ($L_{diff}, L_{dir}$) are mentioned but never formally defined in the main paper. * The evaluation focuses heavily on PSNR/SSIM metrics, but temporal co

Reviewer 02Rating 4Confidence 3

Strengths

1. Decoupling the 4DGS makes sense. 2. This paper presents a good FPS performance.

Weaknesses

1. For the quantitative comparison in Table 1. The improvement in PSNR is limited. 2. The visualized comparison is weak. It is better to provide video comparison in supplementarials.

Reviewer 03Rating 6Confidence 4

Strengths

* The paper clearly identifies a core weakness in existing dynamic 3D Gaussian Splatting methods, namely their reliance on per-frame optimization, which causes temporal inconsistency and overfitting to instantaneous states. The motivation for introducing a multi-frame modeling framework is well justified and directly addresses this limitation. * The proposed multi-frame temporal deformation network interleaves MLP layers with temporal self-attention blocks and a gated fusion mechanism. This hyb

Weaknesses

* My main concern regarding the design of Mango-GS lies in the insufficient justification for using a Transformer to address temporal inconsistency. First, the advantage of Transformer is its ability to capture long-range dependencies, but Table 2 shows that the optimal temporal window is only six frames. Is using a Transformer to model such a short sequence truly necessary, and does the computational cost justify the potential performance gain? Furthermore, several prior works have also explore

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Advanced Vision and Imaging · 3D Shape Modeling and Analysis