LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference
Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, and Armand Joulin, Herv\'e J\'egou, Matthijs Douze

TL;DR
LeViT is a hybrid vision transformer architecture optimized for fast inference, combining convolutional principles with attention mechanisms to achieve superior speed and accuracy trade-offs across various hardware platforms.
Contribution
The paper introduces LeViT, a novel hybrid neural network architecture that integrates convolutional design principles with transformers for efficient image classification.
Findings
LeViT outperforms existing convnets and transformers in speed/accuracy tradeoff.
At 80% ImageNet accuracy, LeViT is 5 times faster than EfficientNet on CPU.
Extensive experiments validate the effectiveness of the design choices.
Abstract
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/levit-384model· 13 dl13 dl
- 🤗facebook/levit-256model· 208 dl208 dl
- 🤗facebook/levit-192model· 177 dl177 dl
- 🤗facebook/levit-128model· 17 dl17 dl
- 🤗facebook/levit-128Smodel· 3.1k dl· ♡ 43.1k dl♡ 4
- 🤗kadirnar/timm_model_listmodel· ♡ 1♡ 1
- 🤗timm/levit_128.fb_dist_in1kmodel· 1.9k dl· ♡ 11.9k dl♡ 1
- 🤗timm/levit_128s.fb_dist_in1kmodel· 3.1k dl· ♡ 23.1k dl♡ 2
- 🤗timm/levit_192.fb_dist_in1kmodel· 227 dl227 dl
- 🤗timm/levit_256.fb_dist_in1kmodel· 24k dl· ♡ 224k dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · ReLU6 · Multi-Head Attention · Hard Swish · Softmax · Layer Normalization · LeViT Attention Block · LeVIT · *Communicated@Fast*How Do I Communicate to Expedia?
