Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Yutian Chen; Yuheng Qiu; Ruogu Li; Ali Agha; Shayegan Omidshafiei; Jay Patrikar; Sebastian Scherer

arXiv:2511.14751·cs.CV·May 15, 2026

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, Sebastian Scherer

PDF

TL;DR

Co-Me introduces a confidence-guided token merging method that accelerates visual geometric transformers by selectively merging uncertain tokens, achieving significant speedups without retraining.

Contribution

It presents a novel confidence-based token merging technique that enhances transformer efficiency while preserving performance across multiple visual geometric tasks.

Findings

01

Up to 21.5x speedup on VGGT

02

Up to 20.4x speedup on Pi3

03

Maintains performance without retraining or finetuning

Abstract

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and Pi3, Co-Me achieves up to 21.5x and 20.4x speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.