LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

Zhijian Shu; Cheng Lin; Tao Xie; Wei Yin; Ben Li; Zhiyuan Pu; Weize Li; Yao Yao; Xun Cao; Xiaoyang Guo; Xiao-Xiao Long

arXiv:2512.04939·cs.CV·December 5, 2025

LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, Xiao-Xiao Long

PDF

Open Access 1 Models

TL;DR

LiteVGGT significantly accelerates and reduces memory usage of VGGT models for 3D reconstruction by leveraging geometry-aware token merging and caching strategies, enabling large-scale scene processing.

Contribution

It introduces a novel geometry-aware cached token merging method that enhances VGGT efficiency without sacrificing core performance.

Findings

01

Achieves up to 10x speedup and substantial memory reduction.

02

Enables processing of 1000-image scenes efficiently.

03

Maintains core 3D reconstruction accuracy.

Abstract

3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ZhijianShu/LiteVGGT
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Advanced Neural Network Applications