Rethinking Vision Transformers for MobileNet Size and Speed
Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi, Wang, Sergey Tulyakov, Jian Ren

TL;DR
This paper introduces EfficientFormerV2, a vision transformer architecture optimized for mobile devices that achieves higher accuracy than MobileNetV2 while maintaining similar size and latency, challenging the notion that transformers are inherently less efficient.
Contribution
It proposes a novel supernet and a fine-grained joint search strategy to design efficient transformer architectures suitable for resource-constrained environments.
Findings
EfficientFormerV2 outperforms MobileNetV2 by 3.5% top-1 accuracy on ImageNet-1K.
The models achieve similar latency and parameter count as MobileNetV2.
Properly designed transformers can match MobileNet-level efficiency and performance.
Abstract
With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as MobileNet and maintain a similar size? We revisit the design choices of ViTs and propose a novel supernet with low latency and high parameter efficiency. We further introduce a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- rwightman/pytorch-image-modelspytorchOfficial
- MindSpore-paper-code-2/code3/tree/main/ssd_mobilenetV2_FPNlitemindspore
- MindSpore-paper-code-3/code9/tree/main/MobileNetmindspore
- 2023-MindSpore-4/Code11/tree/main/ssd_mobilenetV2_FPNlitemindspore
- leondgarse/keras_cv_attention_models/tree/main/keras_cv_attention_models/efficientformertf
- 🤗timm/efficientformerv2_l.snap_dist_in1kmodel· 449 dl· ♡ 1449 dl♡ 1
- 🤗timm/efficientformerv2_s0.snap_dist_in1kmodel· 4.2k dl· ♡ 14.2k dl♡ 1
- 🤗timm/efficientformerv2_s1.snap_dist_in1kmodel· 956 dl· ♡ 1956 dl♡ 1
- 🤗timm/efficientformerv2_s2.snap_dist_in1kmodel· 2.8k dl· ♡ 22.8k dl♡ 2
- 🤗qualcomm/EfficientFormermodel· 49 dl· ♡ 149 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors
MethodsPointwise Convolution · Depthwise Convolution · Depthwise Separable Convolution · Batch Normalization · 1x1 Convolution · Convolution · Inverted Residual Block · Average Pooling
