SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design
Seokju Yun, Youngmin Ro

TL;DR
SHViT introduces a single-head attention mechanism and macro design optimizations to create a more memory-efficient vision transformer that achieves superior speed-accuracy tradeoffs on various devices.
Contribution
The paper proposes a novel single-head attention module and macro design strategies for efficient vision transformers, reducing redundancy and improving performance.
Findings
SHViT-S4 is significantly faster than MobileViTv2 on multiple devices.
SHViT achieves higher accuracy with reduced computational cost.
Comparable object detection and segmentation performance with lower latency.
Abstract
Recently, efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally, they use 4x4 patch embeddings and a 4-stage structure at the macro level, while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore, our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions, and several attention heads in the latter stages are computationally redundant. To handle this, we introduce a single-head attention module that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timm/shvit_s1.in1kmodel· 120 dl120 dl
- 🤗timm/shvit_s2.in1kmodel· 48 dl48 dl
- 🤗timm/shvit_s3.in1kmodel· 272 dl272 dl
- 🤗timm/shvit_s4.in1kmodel· 91 dl91 dl
- 🤗christo357/shvit_s2-cifarmodel
- 🤗christo357/shvit_s2-eurosatmodel
- 🤗christo357/shvit_s2-medmnistmodel
- 🤗christo357/shvit_s3-medmnistmodel
- 🤗christo357/shvit_s3-eurosatmodel
- 🤗christo357/shvit_s3-cifarmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Advanced Vision and Imaging
MethodsAttention Is All You Need · MobileViTv2 · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Layer Normalization · Multi-Head Attention · Adam · Softmax
