EfficientViT: Memory Efficient Vision Transformer with Cascaded Group   Attention

Xinyu Liu; Houwen Peng; Ningxin Zheng; Yuqing Yang; Han Hu; Yixuan; Yuan

arXiv:2305.07027·cs.CV·May 12, 2023·36 cites

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan, Yuan

PDF

Open Access 4 Repos 6 Models

TL;DR

EfficientViT introduces memory-efficient and computationally optimized vision transformers using cascaded group attention, achieving higher speed and accuracy trade-offs for real-time applications.

Contribution

The paper proposes a novel memory-efficient transformer architecture with cascaded group attention, reducing redundancy and improving speed without sacrificing accuracy.

Findings

01

Outperforms existing efficient models in speed and accuracy

02

EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% accuracy

03

Achieves 40.4% and 45.2% higher throughput on GPU and CPU

Abstract

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · Advanced Neural Network Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings