Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios
Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui, Wang, Min Zheng, Xin Pan

TL;DR
Next-ViT introduces a new vision Transformer architecture optimized for fast, accurate inference in industrial scenarios, outperforming CNNs and existing ViTs in latency and accuracy trade-offs.
Contribution
The paper proposes Next-ViT with deployment-friendly blocks and a hybrid strategy, achieving superior latency/accuracy balance in industrial vision tasks.
Findings
Outperforms CNNs and ViTs in latency/accuracy trade-off
Surpasses ResNet and EfficientFormer in industrial benchmarks
Achieves 3.6x faster inference speed than CSWin
Abstract
Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timm/nextvit_base.bd_in1kmodel· 202 dl202 dl
- 🤗timm/nextvit_base.bd_in1k_384model· 36 dl36 dl
- 🤗timm/nextvit_base.bd_ssld_6m_in1kmodel· 171 dl· ♡ 1171 dl♡ 1
- 🤗timm/nextvit_base.bd_ssld_6m_in1k_384model· 82 dl· ♡ 382 dl♡ 3
- 🤗timm/nextvit_large.bd_in1kmodel· 87 dl87 dl
- 🤗timm/nextvit_large.bd_in1k_384model· 34 dl34 dl
- 🤗timm/nextvit_large.bd_ssld_6m_in1kmodel· 36 dl36 dl
- 🤗timm/nextvit_large.bd_ssld_6m_in1k_384model· 35 dl· ♡ 135 dl♡ 1
- 🤗timm/nextvit_small.bd_in1kmodel· 82 dl82 dl
- 🤗timm/nextvit_small.bd_in1k_384model· 37 dl37 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Adam
