Revisiting the Integration of Convolution and Attention for Vision   Backbone

Lei Zhu; Xinjiang Wang; Wayne Zhang; Rynson W. H. Lau

arXiv:2411.14429·cs.CV·November 22, 2024

Revisiting the Integration of Convolution and Attention for Vision Backbone

Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W. H. Lau

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper proposes a novel vision backbone architecture called GLMix that integrates convolutions and multi-head self-attention at different granularities, improving efficiency and interpretability.

Contribution

It introduces a parallel multi-granularity integration scheme with soft clustering, enabling local-global feature fusion and matching state-of-the-art performance with fewer attention slots.

Findings

01

Efficiently matches state-of-the-art performance with fewer attention slots.

02

Soft clustering produces meaningful semantic groupings.

03

The approach enhances interpretability and is suitable for weakly-supervised tasks.

Abstract

Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rayleizhu/glmix
pytorchOfficial

Videos

Revisiting the Integration of Convolution and Attention for Vision Backbone· slideslive

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Infrared Target Detection Methodologies

MethodsSparse Evolutionary Training