SepViT: Separable Vision Transformer
Wei Li, Xing Wang, Xin Xia, Jie Wu, Jiashi Li, Xuefeng Xiao, Min, Zheng, Shiping Wen

TL;DR
SepViT introduces an efficient vision transformer that reduces computational costs by employing depthwise separable self-attention, achieving high accuracy and lower latency across multiple vision tasks.
Contribution
The paper proposes SepViT, a novel separable vision transformer architecture that incorporates depthwise separable self-attention for improved efficiency and performance.
Findings
Achieves 84.2% top-1 accuracy on ImageNet-1K with 40% less latency.
Demonstrates state-of-the-art performance on ADE20K, COCO detection, and segmentation tasks.
Reduces computational costs while maintaining high accuracy.
Abstract
Vision Transformers have witnessed prevailing success in a series of vision tasks. However, these Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices. To alleviate this issue, we draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT. SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention. The novel window token embedding and grouped self-attention are employed to compute the attention relationship among windows with negligible cost and establish long-range visual interactions across multiple windows, respectively. Extensive experiments on general-purpose vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection
MethodsAttention Is All You Need · Linear Layer · Pointwise Convolution · Softmax · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Depthwise Convolution · Label Smoothing
