EfficientFormer: Vision Transformers at MobileNet Speed

Yanyu Li; Geng Yuan; Yang Wen; Ju Hu; Georgios Evangelidis; Sergey; Tulyakov; Yanzhi Wang; Jian Ren

arXiv:2206.01191·cs.CV·October 12, 2022·253 cites

EfficientFormer: Vision Transformers at MobileNet Speed

Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey, Tulyakov, Yanzhi Wang, Jian Ren

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

EfficientFormer introduces a pure transformer architecture optimized for mobile devices, achieving comparable or superior speed and accuracy to lightweight CNNs like MobileNetV2, enabling real-time vision applications.

Contribution

The paper presents a new transformer design, EfficientFormer, that matches MobileNet speed on mobile hardware while maintaining high accuracy, through architecture analysis and latency-driven optimization.

Findings

01

EfficientFormer-L1 achieves 79.2% top-1 accuracy with 1.6 ms latency on iPhone 12.

02

EfficientFormer-L7 achieves 83.3% accuracy with 7.0 ms latency.

03

The proposed models outperform existing ViT-based models in speed and performance on mobile devices.

Abstract

Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, \textit{e.g.}, attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

EfficientFormer: Vision Transformers at MobileNet Speed· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Visual Attention and Saliency Detection

MethodsPoolFormer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pointwise Convolution · Convolution · Batch Normalization · Depthwise Convolution · Depthwise Separable Convolution · 1x1 Convolution · Average Pooling · Inverted Residual Block