Improving Vision Transformers by Overlapping Heads in Multi-Head   Self-Attention

Tianxiao Zhang; Bo Luo; Guanghui Wang

arXiv:2410.14874·cs.CV·February 4, 2025

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention

Tianxiao Zhang, Bo Luo, Guanghui Wang

PDF

Open Access

TL;DR

This paper introduces Multi-Overlapped-Head Self-Attention (MOHSA), a novel approach that overlaps attention heads in Vision Transformers to improve their performance across multiple datasets.

Contribution

The paper proposes MOHSA, a new head-overlapping technique for Vision Transformers, demonstrating significant performance improvements over standard multi-head self-attention.

Findings

01

MOHSA outperforms standard MHSA on multiple benchmarks.

02

Overlapping heads with adjacent ones enhances feature learning.

03

Optimal overlapping ratios vary across models and datasets.

Abstract

Vision Transformers have made remarkable progress in recent years, achieving state-of-the-art performance in most vision tasks. A key component of this success is due to the introduction of the Multi-Head Self-Attention (MHSA) module, which enables each head to learn different representations by applying the attention mechanism independently. In this paper, we empirically demonstrate that Vision Transformers can be further enhanced by overlapping the heads in MHSA. We introduce Multi-Overlapped-Head Self-Attention (MOHSA), where heads are overlapped with their two adjacent heads for queries, keys, and values, while zero-padding is employed for the first and last heads, which have only one neighboring head. Various paradigms for overlapping ratios are proposed to fully investigate the optimal performance of our approach. The proposed approach is evaluated using five Transformer models on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification · Neural Networks and Applications

MethodsAttention Is All You Need · Dense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Linear Layer · Softmax · Multi-Head Attention · Dropout