HTTM: Head-wise Temporal Token Merging for Faster VGGT

Weitian Wang; Lukas Meiner; Rai Shubham; Cecilia De La Parra; Akash Kumar

arXiv:2511.21317·cs.CV·November 27, 2025

HTTM: Head-wise Temporal Token Merging for Faster VGGT

Weitian Wang, Lukas Meiner, Rai Shubham, Cecilia De La Parra, Akash Kumar

PDF

Open Access

TL;DR

This paper introduces HTTM, a novel token merging method for VGGT that significantly accelerates 3D scene reconstruction with minimal performance loss by merging tokens at the head level.

Contribution

HTTM is a training-free, head-wise token merging technique that preserves feature diversity and improves merging efficiency in VGGT models.

Findings

01

Achieves up to 7x acceleration in inference.

02

Maintains high reconstruction quality with negligible performance drops.

03

Leverages spatial locality and temporal correspondence for efficient merging.

Abstract

The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Computer Graphics and Visualization Techniques