FastViT: A Fast Hybrid Vision Transformer using Structural   Reparameterization

Pavan Kumar Anasosalu Vasu; James Gabriel; Jeff Zhu; Oncel Tuzel,; Anurag Ranjan

arXiv:2303.14189·cs.CV·August 21, 2023·45 cites

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel,, Anurag Ranjan

PDF

Open Access 5 Repos 10 Models

TL;DR

FastViT is a hybrid vision transformer that achieves a superior latency-accuracy balance by using structural reparameterization and innovative design choices, outperforming existing models on various tasks and devices.

Contribution

Introduces FastViT, a novel hybrid transformer architecture with RepMixer and structural reparameterization, achieving state-of-the-art speed and accuracy trade-offs.

Findings

01

3.5x faster than CMT on mobile

02

4.2% better Top-1 accuracy than MobileOne at similar latency

03

Outperforms competitors in classification, detection, segmentation, and 3D tasks

Abstract

The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · ConvNeXt · *Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Pointwise Convolution · Depthwise Convolution · Depthwise Separable Convolution · Linear Layer · Inverted Residual Block