ACC-ViT : Atrous Convolution's Comeback in Vision Transformers
Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara

TL;DR
This paper introduces ACC-ViT, a hybrid vision transformer that combines regional and sparse attention through Atrous Attention, inspired by atrous convolution, achieving high accuracy with fewer parameters and versatility across tasks.
Contribution
The work proposes Atrous Attention to fuse local and global information in vision transformers and redesigns convolution blocks with atrous convolution, creating a versatile hybrid backbone.
Findings
Achieves ~84% accuracy on ImageNet-1K with fewer parameters.
Outperforms state-of-the-art MaxViT by 0.42% accuracy.
Effective across various tasks like medical imaging, detection, and contrastive learning.
Abstract
Transformers have elevated to the state-of-the-art vision architectures through innovations in attention mechanism inspired from visual perception. At present two classes of attentions prevail in vision transformers, regional and sparse attention. The former bounds the pixel interactions within a region; the latter spreads them across sparse grids. The opposing natures of them have resulted in a dilemma between either preserving hierarchical relation or attaining a global context. In this work, taking inspiration from atrous convolution, we introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information, while maintaining hierarchical relations. As a further tribute to atrous convolution, we redesign the ubiquitous inverted residual convolution blocks with atrous convolution. Finally, we propose a generalized,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies
MethodsAttention Is All You Need · Softmax · Dense Connections · Residual Connection · Linear Layer · Layer Normalization · Multi-Head Attention · Convolution · Vision Transformer
