Revisiting the Integration of Convolution and Attention for Vision Backbone
Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W. H. Lau

TL;DR
This paper proposes a novel vision backbone architecture called GLMix that integrates convolutions and multi-head self-attention at different granularities, improving efficiency and interpretability.
Contribution
It introduces a parallel multi-granularity integration scheme with soft clustering, enabling local-global feature fusion and matching state-of-the-art performance with fewer attention slots.
Findings
Efficiently matches state-of-the-art performance with fewer attention slots.
Soft clustering produces meaningful semantic groupings.
The approach enhances interpretability and is suitable for weakly-supervised tasks.
Abstract
Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Infrared Target Detection Methodologies
MethodsSparse Evolutionary Training
