Multimodal Continuous Visual Attention Mechanisms
Ant\'onio Farinhas, Andr\'e F. T. Martins, Pedro M. Q. Aguiar

TL;DR
This paper introduces a multimodal continuous attention mechanism using Gaussian mixtures, improving interpretability and performance in visual tasks by modeling complex, non-contiguous image regions.
Contribution
The paper proposes a novel continuous attention model with multimodal densities, enabling better modeling of complex image regions and improved interpretability over unimodal approaches.
Findings
Achieves competitive accuracy on VQA-v2 dataset.
Produces attention maps that closely resemble human attention.
Automatically segregates objects in complex scenes.
Abstract
Visual attention mechanisms are a key component of neural network models for computer vision. By focusing on a discrete set of objects or image regions, these mechanisms identify the most relevant features and use them to build more powerful representations. Recently, continuous-domain alternatives to discrete attention models have been proposed, which exploit the continuity of images. These approaches model attention as simple unimodal densities (e.g. a Gaussian), making them less suitable to deal with images whose region of interest has a complex shape or is composed of multiple non-contiguous patches. In this paper, we introduce a new continuous attention mechanism that produces multimodal densities, in the form of mixtures of Gaussians. We use the EM algorithm to obtain a clustering of relevant regions in the image, and a description length penalty to select the number of components…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
