EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan, Yuan

TL;DR
EfficientViT introduces memory-efficient and computationally optimized vision transformers using cascaded group attention, achieving higher speed and accuracy trade-offs for real-time applications.
Contribution
The paper proposes a novel memory-efficient transformer architecture with cascaded group attention, reducing redundancy and improving speed without sacrificing accuracy.
Findings
Outperforms existing efficient models in speed and accuracy
EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% accuracy
Achieves 40.4% and 45.2% higher throughput on GPU and CPU
Abstract
Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timm/efficientvit_m0.r224_in1kmodel· 646 dl646 dl
- 🤗timm/efficientvit_m1.r224_in1kmodel· 208 dl208 dl
- 🤗timm/efficientvit_m2.r224_in1kmodel· 444 dl444 dl
- 🤗timm/efficientvit_m3.r224_in1kmodel· 869 dl869 dl
- 🤗timm/efficientvit_m4.r224_in1kmodel· 222 dl222 dl
- 🤗timm/efficientvit_m5.r224_in1kmodel· 1.6k dl1.6k dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · Advanced Neural Network Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
