PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

Haotang Li,Zhenyu Qi,Shaohan Henry Wang,Kebin Peng,Zi Wang,Qing Guo,Sen He,Huanrui Yang

arXiv:2605.08371·cs.CV·May 12, 2026

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

Haotang Li,Zhenyu Qi,Shaohan Henry Wang,Kebin Peng,Zi Wang,Qing Guo,Sen He,Huanrui Yang

PDF

TL;DR

PaceVGGT introduces a pre-attention token pruning method for VGGT models, significantly reducing inference latency while maintaining reconstruction quality in 3D tasks.

Contribution

It proposes a novel pre-attention token pruning framework with a lightweight scorer and feature-guided restoration, enabling faster VGGT inference without quality loss.

Findings

01

Reduces ScanNet-50 latency by 5.1x at N=300 tokens.

02

Achieves 1.47x latency reduction over LiteVGGT at N=1000.

03

Maintains reconstruction quality while accelerating inference.

Abstract

Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.