Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers
Dong Hoon Lee, Seunghoon Hong

TL;DR
This paper introduces Decoupled Token Embedding for Merging (DTEM), a novel method that improves token merging in Vision Transformers by learning dedicated embeddings through a differentiable process, enhancing efficiency across multiple tasks.
Contribution
The paper proposes a decoupled embedding approach for token merging in ViTs, enabling modular training and better feature extraction for efficient token reduction.
Findings
Achieves 37.2% FLOPs reduction on ImageNet-1k classification.
Maintains 79.85% top-1 accuracy with DeiT-small.
Demonstrates consistent improvements in classification, captioning, and segmentation tasks.
Abstract
Recent token reduction methods for Vision Transformers (ViTs) incorporate token merging, which measures the similarities between token embeddings and combines the most similar pairs. However, their merging policies are directly dependent on intermediate features in ViTs, which prevents exploiting features tailored for merging and requires end-to-end training to improve token merging. In this paper, we propose Decoupled Token Embedding for Merging (DTEM) that enhances token merging through a decoupled embedding learned via a continuously relaxed token merging process. Our method introduces a lightweight embedding module decoupled from the ViT forward pass to extract dedicated features for token merging, thereby addressing the restriction from using intermediate features. The continuously relaxed token merging, applied during training, enables us to learn the decoupled embeddings in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
