Efficient Visual Transformer by Learnable Token Merging
Yancheng Wang, Yingzhen Yang

TL;DR
This paper introduces LTM-Transformer, a learnable token merging scheme that enhances the efficiency of visual transformers by reducing computational costs while maintaining or improving accuracy.
Contribution
The paper proposes a novel learnable token merging method for transformers, compatible with existing models, that reduces FLOPs and inference time without sacrificing accuracy.
Findings
LTM-Transformer reduces FLOPs and inference time.
Maintains or improves prediction accuracy.
Effective across multiple popular visual transformer architectures.
Abstract
Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different neural architectures, including those with convolutions, leading to various visual transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a learnable scheme. LTM-Transformer is compatible with many popular and compact transformer networks, and it reduces the FLOPs and the inference time of the visual transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular visual transformers, including MobileViT, EfficientViT, ViT, and Swin, with LTM-Transformer blocks, leading to LTM-Transformer networks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image and Video Stabilization · Image Processing Techniques and Applications
MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Dense Connections
