ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

Nabil Ibtehaz; Ning Yan; Masood Mortazavi; Daisuke Kihara

arXiv:2403.04200·cs.CV·March 8, 2024·1 cites

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara

PDF

Open Access

TL;DR

This paper introduces ACC-ViT, a hybrid vision transformer that combines regional and sparse attention through Atrous Attention, inspired by atrous convolution, achieving high accuracy with fewer parameters and versatility across tasks.

Contribution

The work proposes Atrous Attention to fuse local and global information in vision transformers and redesigns convolution blocks with atrous convolution, creating a versatile hybrid backbone.

Findings

01

Achieves ~84% accuracy on ImageNet-1K with fewer parameters.

02

Outperforms state-of-the-art MaxViT by 0.42% accuracy.

03

Effective across various tasks like medical imaging, detection, and contrastive learning.

Abstract

Transformers have elevated to the state-of-the-art vision architectures through innovations in attention mechanism inspired from visual perception. At present two classes of attentions prevail in vision transformers, regional and sparse attention. The former bounds the pixel interactions within a region; the latter spreads them across sparse grids. The opposing natures of them have resulted in a dilemma between either preserving hierarchical relation or attaining a global context. In this work, taking inspiration from atrous convolution, we introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information, while maintaining hierarchical relations. As a further tribute to atrous convolution, we redesign the ubiquitous inverted residual convolution blocks with atrous convolution. Finally, we propose a generalized,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrared Target Detection Methodologies

MethodsAttention Is All You Need · Softmax · Dense Connections · Residual Connection · Linear Layer · Layer Normalization · Multi-Head Attention · Convolution · Vision Transformer