Next-ViT: Next Generation Vision Transformer for Efficient Deployment in   Realistic Industrial Scenarios

Jiashi Li; Xin Xia; Wei Li; Huixia Li; Xing Wang; Xuefeng Xiao; Rui; Wang; Min Zheng; Xin Pan

arXiv:2207.05501·cs.CV·August 17, 2022·138 cites

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui, Wang, Min Zheng, Xin Pan

PDF

Open Access 5 Repos 10 Models

TL;DR

Next-ViT introduces a new vision Transformer architecture optimized for fast, accurate inference in industrial scenarios, outperforming CNNs and existing ViTs in latency and accuracy trade-offs.

Contribution

The paper proposes Next-ViT with deployment-friendly blocks and a hybrid strategy, achieving superior latency/accuracy balance in industrial vision tasks.

Findings

01

Outperforms CNNs and ViTs in latency/accuracy trade-off

02

Surpasses ResNet and EfficientFormer in industrial benchmarks

03

Achieves 3.6x faster inference speed than CSWin

Abstract

Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Adam