Vision Transformers with Hierarchical Attention

Yun Liu; Yu-Huan Wu; Guolei Sun; Le Zhang; Ajad Chhatkuli; Luc Van; Gool

arXiv:2106.03180·cs.CV·March 27, 2024

Vision Transformers with Hierarchical Attention

Yun Liu, Yu-Huan Wu, Guolei Sun, Le Zhang, Ajad Chhatkuli, Luc Van, Gool

PDF

3 Repos

TL;DR

This paper introduces Hierarchical Multi-Head Self-Attention (H-MHSA), a novel method that reduces computational complexity in vision transformers by modeling local and global relationships hierarchically, leading to efficient and powerful image understanding.

Contribution

The paper proposes H-MHSA, a hierarchical attention mechanism that significantly reduces computation while maintaining detailed local and global feature modeling in vision transformers.

Findings

01

HAT-Net achieves superior performance on vision tasks.

02

H-MHSA reduces computational load dramatically.

03

Effective modeling of local and global dependencies.

Abstract

This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vanilla vision transformers. To this end, we propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Dense Connections · Softmax · Vision Transformer