Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

Zichen Zou; Xiaosong Jia; Zuxuan Wu; Yu-Gang Jiang

arXiv:2605.09644·cs.CV·May 12, 2026

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

Zichen Zou, Xiaosong Jia, Zuxuan Wu, Yu-Gang Jiang

PDF

1 Repo

TL;DR

RetrieveVGGT introduces a training-free, retrieval-based method for long-context 3D reconstruction that maintains constant memory and outperforms previous methods.

Contribution

It formulates context construction as a retrieval problem, using similarity between current queries and cached keys to enable scalable, memory-efficient 3D reconstruction.

Findings

01

RetrieveVGGT outperforms state-of-the-art methods in 3D reconstruction.

02

It maintains constant memory usage regardless of sequence length.

03

The method achieves superior reconstruction quality in experiments.

Abstract

Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zzctmd/RetrieveVGGT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.