Learning to Merge Tokens via Decoupled Embedding for Efficient Vision   Transformers

Dong Hoon Lee; Seunghoon Hong

arXiv:2412.10569·cs.CV·December 17, 2024

Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers

Dong Hoon Lee, Seunghoon Hong

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Decoupled Token Embedding for Merging (DTEM), a novel method that improves token merging in Vision Transformers by learning dedicated embeddings through a differentiable process, enhancing efficiency across multiple tasks.

Contribution

The paper proposes a decoupled embedding approach for token merging in ViTs, enabling modular training and better feature extraction for efficient token reduction.

Findings

01

Achieves 37.2% FLOPs reduction on ImageNet-1k classification.

02

Maintains 79.85% top-1 accuracy with DeiT-small.

03

Demonstrates consistent improvements in classification, captioning, and segmentation tasks.

Abstract

Recent token reduction methods for Vision Transformers (ViTs) incorporate token merging, which measures the similarities between token embeddings and combines the most similar pairs. However, their merging policies are directly dependent on intermediate features in ViTs, which prevents exploiting features tailored for merging and requires end-to-end training to improve token merging. In this paper, we propose Decoupled Token Embedding for Merging (DTEM) that enhances token merging through a decoupled embedding learned via a continuously relaxed token merging process. Our method introduces a lightweight embedding module decoupled from the ViT forward pass to extract dedicated features for token merging, thereby addressing the restriction from using intermediate features. The continuously relaxed token merging, applied during training, enables us to learn the decoupled embeddings in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

movinghoon/dtem
pytorchOfficial

Videos

Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques