Rethinking Vision Transformers for MobileNet Size and Speed

Yanyu Li; Ju Hu; Yang Wen; Georgios Evangelidis; Kamyar Salahi; Yanzhi; Wang; Sergey Tulyakov; Jian Ren

arXiv:2212.08059·cs.CV·September 6, 2023·20 cites

Rethinking Vision Transformers for MobileNet Size and Speed

Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi, Wang, Sergey Tulyakov, Jian Ren

PDF

Open Access 5 Repos 5 Models

TL;DR

This paper introduces EfficientFormerV2, a vision transformer architecture optimized for mobile devices that achieves higher accuracy than MobileNetV2 while maintaining similar size and latency, challenging the notion that transformers are inherently less efficient.

Contribution

It proposes a novel supernet and a fine-grained joint search strategy to design efficient transformer architectures suitable for resource-constrained environments.

Findings

01

EfficientFormerV2 outperforms MobileNetV2 by 3.5% top-1 accuracy on ImageNet-1K.

02

The models achieve similar latency and parameter count as MobileNetV2.

03

Properly designed transformers can match MobileNet-level efficiency and performance.

Abstract

With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as MobileNet and maintain a similar size? We revisit the design choices of ViTs and propose a novel supernet with low latency and high parameter efficiency. We further introduce a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors

MethodsPointwise Convolution · Depthwise Convolution · Depthwise Separable Convolution · Batch Normalization · 1x1 Convolution · Convolution · Inverted Residual Block · Average Pooling