EfficientFormer: Vision Transformers at MobileNet Speed
Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey, Tulyakov, Yanzhi Wang, Jian Ren

TL;DR
EfficientFormer introduces a pure transformer architecture optimized for mobile devices, achieving comparable or superior speed and accuracy to lightweight CNNs like MobileNetV2, enabling real-time vision applications.
Contribution
The paper presents a new transformer design, EfficientFormer, that matches MobileNet speed on mobile hardware while maintaining high accuracy, through architecture analysis and latency-driven optimization.
Findings
EfficientFormer-L1 achieves 79.2% top-1 accuracy with 1.6 ms latency on iPhone 12.
EfficientFormer-L7 achieves 83.3% accuracy with 7.0 ms latency.
The proposed models outperform existing ViT-based models in speed and performance on mobile devices.
Abstract
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, \textit{e.g.}, attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- rwightman/pytorch-image-modelspytorchOfficial
- code-implementation1/Code9/tree/main/MobileNetmindspore
- 2024-MindSpore-1/Code2/tree/main/model-1/efficientformermindspore
- leondgarse/keras_cv_attention_models/tree/main/keras_cv_attention_models/efficientformertf
- 2023-MindSpore-1/ms-code-1/tree/main/MobileNetmindspore
- 🤗NimaBoscarino/efficientformer-l1-1000model· 9 dl9 dl
- 🤗NimaBoscarino/efficientformer-l1-300model· 11 dl11 dl
- 🤗NimaBoscarino/efficientformer-l3-300model· 6 dl· ♡ 26 dl♡ 2
- 🤗NimaBoscarino/efficientformer-l7-300model· 12 dl12 dl
- 🤗snap-research/efficientformer-l1-300model· 814 dl· ♡ 4814 dl♡ 4
- 🤗snap-research/efficientformer-l7-300model· 6 dl· ♡ 16 dl♡ 1
- 🤗snap-research/efficientformer-l3-300model· 68 dl· ♡ 368 dl♡ 3
- 🤗kadirnar/timm_model_listmodel· ♡ 1♡ 1
- 🤗timm/efficientformer_l1.snap_dist_in1kmodel· 3.8k dl· ♡ 23.8k dl♡ 2
- 🤗timm/efficientformer_l3.snap_dist_in1kmodel· 346 dl· ♡ 1346 dl♡ 1
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Visual Attention and Saliency Detection
MethodsPoolFormer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pointwise Convolution · Convolution · Batch Normalization · Depthwise Convolution · Depthwise Separable Convolution · 1x1 Convolution · Average Pooling · Inverted Residual Block
