Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

Tianchen Deng; Zhenxiang Xiong; Nailin Wang; Fangjinhua Wang; Jiuming Liu; Jianfei Yang; Hesheng Wang

arXiv:2605.17478·cs.CV·May 19, 2026

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

Tianchen Deng, Zhenxiang Xiong, Nailin Wang, Fangjinhua Wang, Jiuming Liu, Jianfei Yang, Hesheng Wang

PDF

TL;DR

Mamba-VGGT introduces a novel external memory module with a sliding window mechanism to enable persistent long-range reasoning in geometry-grounded transformers, significantly improving 3D scene reconstruction over long sequences.

Contribution

The paper proposes a Sliding Window Mamba memory module and a Zero-Init Spatial Memory Injector to address geometric drift in VGGT models, enabling scalable long-term reasoning.

Findings

01

Outperforms existing VGGT methods in spatial consistency.

02

Reduces trajectory accumulation errors.

03

Provides a scalable linear-complexity solution.

Abstract

Visual Geometry Grounded Transformers (VGGT) have set new benchmarks in high-fidelity 3D scene reconstruction. However, as the sequence length increases, these models suffer from catastrophic geometric forgetting and accumulation drift, primarily due to the quadratic complexity of global attention which necessitates truncated temporal windows. To overcome the resulting geometric drift, we present Mamba-VGGT, an enhanced VGGT framework capable of persistent long-range reasoning. Our key contribution is a Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows. This module leverages selective state-space modeling to distill and propagate global geometric priors, effectively bypassing the memory constraints of traditional transformers. To integrate these long-term temporal cues without disrupting the highly optimized spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.