ToSA: Token Merging with Spatial Awareness

Hsiang-Wei Huang; Wenhao Chai; Kuang-Ming Chen; Cheng-Yen Yang; Jenq-Neng Hwang

arXiv:2506.20066·cs.CV·June 26, 2025

ToSA: Token Merging with Spatial Awareness

Hsiang-Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang

PDF

Open Access

TL;DR

ToSA introduces a token merging method for Vision Transformers that integrates spatial information via depth images, leading to better scene structure preservation and improved efficiency in visual tasks.

Contribution

ToSA is the first token merging approach that combines semantic and spatial cues using depth images, enhancing ViT acceleration and accuracy.

Findings

01

Outperforms previous token merging methods on multiple benchmarks.

02

Reduces runtime of Vision Transformers significantly.

03

Improves scene structure preservation during token merging.

Abstract

Token merging has emerged as an effective strategy to accelerate Vision Transformers (ViT) by reducing computational costs. However, existing methods primarily rely on the visual token's feature similarity for token merging, overlooking the potential of integrating spatial information, which can serve as a reliable criterion for token merging in the early layers of ViT, where the visual tokens only possess weak visual information. In this paper, we propose ToSA, a novel token merging method that combines both semantic and spatial awareness to guide the token merging process. ToSA leverages the depth image as input to generate pseudo spatial tokens, which serve as auxiliary spatial information for the visual token merging process. With the introduced spatial awareness, ToSA achieves a more informed merging strategy that better preserves critical scene structure. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuction Theory and Applications · Sharing Economy and Platforms · Optimization and Search Problems