PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers
Haotang Li,Zhenyu Qi,Shaohan Henry Wang,Kebin Peng,Zi Wang,Qing Guo,Sen He,Huanrui Yang

TL;DR
PaceVGGT introduces a pre-attention token pruning method for VGGT models, significantly reducing inference latency while maintaining reconstruction quality in 3D tasks.
Contribution
It proposes a novel pre-attention token pruning framework with a lightweight scorer and feature-guided restoration, enabling faster VGGT inference without quality loss.
Findings
Reduces ScanNet-50 latency by 5.1x at N=300 tokens.
Achieves 1.47x latency reduction over LiteVGGT at N=1000.
Maintains reconstruction quality while accelerating inference.
Abstract
Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
