HTTM: Head-wise Temporal Token Merging for Faster VGGT
Weitian Wang, Lukas Meiner, Rai Shubham, Cecilia De La Parra, Akash Kumar

TL;DR
This paper introduces HTTM, a novel token merging method for VGGT that significantly accelerates 3D scene reconstruction with minimal performance loss by merging tokens at the head level.
Contribution
HTTM is a training-free, head-wise token merging technique that preserves feature diversity and improves merging efficiency in VGGT models.
Findings
Achieves up to 7x acceleration in inference.
Maintains high reconstruction quality with negligible performance drops.
Leverages spatial locality and temporal correspondence for efficient merging.
Abstract
The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Computer Graphics and Visualization Techniques
