FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization
Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel,, Anurag Ranjan

TL;DR
FastViT is a hybrid vision transformer that achieves a superior latency-accuracy balance by using structural reparameterization and innovative design choices, outperforming existing models on various tasks and devices.
Contribution
Introduces FastViT, a novel hybrid transformer architecture with RepMixer and structural reparameterization, achieving state-of-the-art speed and accuracy trade-offs.
Findings
3.5x faster than CMT on mobile
4.2% better Top-1 accuracy than MobileOne at similar latency
Outperforms competitors in classification, detection, segmentation, and 3D tasks
Abstract
The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timm/fastvit_ma36.apple_in1kmodel· 1.6k dl· ♡ 11.6k dl♡ 1
- 🤗timm/fastvit_s12.apple_in1kmodel· 2.4k dl2.4k dl
- 🤗timm/fastvit_sa12.apple_in1kmodel· 4.8k dl· ♡ 14.8k dl♡ 1
- 🤗timm/fastvit_sa24.apple_in1kmodel· 1.5k dl· ♡ 11.5k dl♡ 1
- 🤗timm/fastvit_sa36.apple_in1kmodel· 1.7k dl· ♡ 11.7k dl♡ 1
- 🤗timm/fastvit_t8.apple_in1kmodel· 24k dl· ♡ 224k dl♡ 2
- 🤗timm/fastvit_t12.apple_in1kmodel· 9.0k dl9.0k dl
- 🤗timm/fastvit_ma36.apple_dist_in1kmodel· 62 dl· ♡ 162 dl♡ 1
- 🤗timm/fastvit_s12.apple_dist_in1kmodel· 454 dl· ♡ 2454 dl♡ 2
- 🤗timm/fastvit_sa12.apple_dist_in1kmodel· 1.4k dl· ♡ 11.4k dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · ConvNeXt · *Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Pointwise Convolution · Depthwise Convolution · Depthwise Separable Convolution · Linear Layer · Inverted Residual Block
