LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Ben Graham; Alaaeldin El-Nouby; Hugo Touvron; Pierre Stock; and Armand Joulin; Herv\'e J\'egou; Matthijs Douze

arXiv:2104.01136·cs.CV·May 7, 2021·90 cites

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, and Armand Joulin, Herv\'e J\'egou, Matthijs Douze

PDF

Open Access 5 Repos 10 Models

TL;DR

LeViT is a hybrid vision transformer architecture optimized for fast inference, combining convolutional principles with attention mechanisms to achieve superior speed and accuracy trade-offs across various hardware platforms.

Contribution

The paper introduces LeViT, a novel hybrid neural network architecture that integrates convolutional design principles with transformers for efficient image classification.

Findings

01

LeViT outperforms existing convnets and transformers in speed/accuracy tradeoff.

02

At 80% ImageNet accuracy, LeViT is 5 times faster than EfficientNet on CPU.

03

Extensive experiments validate the effectiveness of the design choices.

Abstract

We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · ReLU6 · Multi-Head Attention · Hard Swish · Softmax · Layer Normalization · LeViT Attention Block · LeVIT · *Communicated@Fast*How Do I Communicate to Expedia?