SepViT: Separable Vision Transformer

Wei Li; Xing Wang; Xin Xia; Jie Wu; Jiashi Li; Xuefeng Xiao; Min; Zheng; Shiping Wen

arXiv:2203.15380·cs.CV·June 16, 2023·31 cites

SepViT: Separable Vision Transformer

Wei Li, Xing Wang, Xin Xia, Jie Wu, Jiashi Li, Xuefeng Xiao, Min, Zheng, Shiping Wen

PDF

Open Access 2 Repos

TL;DR

SepViT introduces an efficient vision transformer that reduces computational costs by employing depthwise separable self-attention, achieving high accuracy and lower latency across multiple vision tasks.

Contribution

The paper proposes SepViT, a novel separable vision transformer architecture that incorporates depthwise separable self-attention for improved efficiency and performance.

Findings

01

Achieves 84.2% top-1 accuracy on ImageNet-1K with 40% less latency.

02

Demonstrates state-of-the-art performance on ADE20K, COCO detection, and segmentation tasks.

03

Reduces computational costs while maintaining high accuracy.

Abstract

Vision Transformers have witnessed prevailing success in a series of vision tasks. However, these Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices. To alleviate this issue, we draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT. SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention. The novel window token embedding and grouped self-attention are employed to compute the attention relationship among windows with negligible cost and establish long-range visual interactions across multiple windows, respectively. Extensive experiments on general-purpose vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection

MethodsAttention Is All You Need · Linear Layer · Pointwise Convolution · Softmax · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Depthwise Convolution · Label Smoothing