FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

Zipeng Wang; Dan Xu

arXiv:2512.01540·cs.CV·March 26, 2026

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

Zipeng Wang, Dan Xu

PDF

Open Access

TL;DR

FlashVGGT introduces a descriptor-based attention mechanism for 3D reconstruction that maintains high accuracy while significantly improving scalability and inference efficiency over long image sequences.

Contribution

It proposes a novel compressed descriptor attention method that reduces computational complexity and enables efficient online inference in large-scale 3D reconstruction tasks.

Findings

01

Achieves comparable accuracy to VGGT in 3D reconstruction.

02

Reduces inference time to 9.3% of VGGT for 1,000 images.

03

Scales efficiently to sequences over 3,000 images.

Abstract

3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization