SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Seokju Yun; Youngmin Ro

arXiv:2401.16456·cs.CV·March 29, 2024·6 cites

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Seokju Yun, Youngmin Ro

PDF

Open Access 1 Repo 10 Models

TL;DR

SHViT introduces a single-head attention mechanism and macro design optimizations to create a more memory-efficient vision transformer that achieves superior speed-accuracy tradeoffs on various devices.

Contribution

The paper proposes a novel single-head attention module and macro design strategies for efficient vision transformers, reducing redundancy and improving performance.

Findings

01

SHViT-S4 is significantly faster than MobileViTv2 on multiple devices.

02

SHViT achieves higher accuracy with reduced computational cost.

03

Comparable object detection and segmentation performance with lower latency.

Abstract

Recently, efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally, they use 4x4 patch embeddings and a 4-stage structure at the macro level, while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore, our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions, and several attention heads in the latter stages are computationally redundant. To handle this, we introduce a single-head attention module that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ysj9909/SHViT
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Advanced Vision and Imaging

MethodsAttention Is All You Need · MobileViTv2 · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Layer Normalization · Multi-Head Attention · Adam · Softmax