Fusion of regional and sparse attention in Vision Transformers

Nabil Ibtehaz; Ning Yan; Masood Mortazavi; Daisuke Kihara

arXiv:2406.08859·cs.CV·June 14, 2024·1 cites

Fusion of regional and sparse attention in Vision Transformers

Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara

PDF

Open Access

TL;DR

This paper introduces Atrous Attention, a hybrid attention mechanism combining regional and sparse attention in Vision Transformers, leading to a new backbone model that improves accuracy and parameter efficiency on ImageNet-1K.

Contribution

The paper proposes Atrous Attention to unify regional and sparse attention, and develops ACC-ViT, a hybrid transformer backbone that outperforms state-of-the-art models in accuracy and efficiency.

Findings

01

Achieves 84% accuracy on ImageNet-1K.

02

Outperforms MaxViT by 0.42% accuracy.

03

Uses 8.4% fewer parameters than MaxViT.

Abstract

Modern vision transformers leverage visually inspired local interaction between pixels through attention computed within window or grid regions, in contrast to the global attention employed in the original ViT. Regional attention restricts pixel interactions within specific regions, while sparse attention disperses them across sparse grids. These differing approaches pose a challenge between maintaining hierarchical relationships vs. capturing a global context. In this study, drawing inspiration from atrous convolution, we propose Atrous Attention, a blend of regional and sparse attention that dynamically integrates both local and global information while preserving hierarchical structures. Based on this, we introduce a versatile, hybrid vision transformer backbone called ACC-ViT, tailored for standard vision tasks. Our compact model achieves approximately 84% accuracy on ImageNet-1K…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrared Target Detection Methodologies · CCD and CMOS Imaging Sensors · Ocular and Laser Science Research

MethodsResidual Connection · Softmax · Layer Normalization · Attention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer