TL;DR
This paper introduces Hierarchical Multi-Head Self-Attention (H-MHSA), a novel method that reduces computational complexity in vision transformers by modeling local and global relationships hierarchically, leading to efficient and powerful image understanding.
Contribution
The paper proposes H-MHSA, a hierarchical attention mechanism that significantly reduces computation while maintaining detailed local and global feature modeling in vision transformers.
Findings
HAT-Net achieves superior performance on vision tasks.
H-MHSA reduces computational load dramatically.
Effective modeling of local and global dependencies.
Abstract
This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vanilla vision transformers. To this end, we propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Dense Connections · Softmax · Vision Transformer
