TL;DR
TurboVGGT introduces an adaptive attention-based transformer for rapid multi-view 3D reconstruction, balancing speed and quality through learned sparse global attention and local frame aggregation.
Contribution
It proposes a novel adaptive alternating attention mechanism that dynamically learns representative tokens for efficient global and local geometry modeling.
Findings
Achieves faster reconstruction with competitive quality on multiple benchmarks.
Outperforms existing methods in efficiency while maintaining high accuracy.
Demonstrates the effectiveness of adaptive sparse attention in 3D reconstruction.
Abstract
Recent feed-forward 3D reconstruction methods, such as visual geometry transformers, have substantially advanced the traditional per-scene optimization paradigm by enabling effective multi-view reconstruction in a single forward pass. However, most existing methods struggle to achieve a balance between reconstruction quality and computational efficiency, which limits their scalability and efficiency. Although some efficient visual geometry transformers have recently emerged, they typically use the same sparsity ratio across layers and frames and lack mechanisms to adaptively learn representative tokens to capture global relationships, leading to suboptimal performance. In this work, we propose TurboVGGT, a novel approach that employs an efficient visual geometry transformer with adaptive alternating attention for fast multi-view 3D reconstruction. Specifically, TurboVGGT employs an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
