Efficient Visual Transformer by Learnable Token Merging

Yancheng Wang; Yingzhen Yang

arXiv:2407.15219·cs.CV·July 22, 2025

Efficient Visual Transformer by Learnable Token Merging

Yancheng Wang, Yingzhen Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces LTM-Transformer, a learnable token merging scheme that enhances the efficiency of visual transformers by reducing computational costs while maintaining or improving accuracy.

Contribution

The paper proposes a novel learnable token merging method for transformers, compatible with existing models, that reduces FLOPs and inference time without sacrificing accuracy.

Findings

01

LTM-Transformer reduces FLOPs and inference time.

02

Maintains or improves prediction accuracy.

03

Effective across multiple popular visual transformer architectures.

Abstract

Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different neural architectures, including those with convolutions, leading to various visual transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a learnable scheme. LTM-Transformer is compatible with many popular and compact transformer networks, and it reduces the FLOPs and the inference time of the visual transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular visual transformers, including MobileViT, EfficientViT, ViT, and Swin, with LTM-Transformer blocks, leading to LTM-Transformer networks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

statistical-deep-learning/ltm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image and Video Stabilization · Image Processing Techniques and Applications

MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Dense Connections