Multiscale Self Attentive Convolutions for Vision and Language Modeling
Oren Barkan

TL;DR
This paper introduces multiscale self attentive convolutions (MSAC), a novel approach that generalizes self attention for vision and language tasks by operating on m-grams and image patches across multiple scales.
Contribution
The paper proposes the MSAC operator, extending self attention to multiscale convolutions for improved modeling of visual and linguistic data.
Findings
MSAC effectively captures multiscale features in images and text.
The approach improves performance on vision and language modeling tasks.
MSAC enables cross-modal image similarity assessment.
Abstract
Self attention mechanisms have become a key building block in many state-of-the-art language understanding models. In this paper, we show that the self attention operator can be formulated in terms of 1x1 convolution operations. Following this observation, we propose several novel operators: First, we introduce a 2D version of self attention that is applicable for 2D signals such as images. Second, we present the 1D and 2D Self Attentive Convolutions (SAC) operator that generalizes self attention beyond 1x1 convolutions to 1xm and nxm convolutions, respectively. While 1D and 2D self attention operate on individual words and pixels, SAC operates on m-grams and image patches, respectively. Third, we present a multiscale version of SAC (MSAC) which analyzes the input by employing multiple SAC operators that vary by filter size, in parallel. Finally, we explain how MSAC can be utilized for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
Methods1x1 Convolution · Convolution
