FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
Zipeng Wang, Dan Xu

TL;DR
FlashVGGT introduces a descriptor-based attention mechanism for 3D reconstruction that maintains high accuracy while significantly improving scalability and inference efficiency over long image sequences.
Contribution
It proposes a novel compressed descriptor attention method that reduces computational complexity and enables efficient online inference in large-scale 3D reconstruction tasks.
Findings
Achieves comparable accuracy to VGGT in 3D reconstruction.
Reduces inference time to 9.3% of VGGT for 1,000 images.
Scales efficiently to sequences over 3,000 images.
Abstract
3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
