Multimodal Continuous Visual Attention Mechanisms

Ant\'onio Farinhas; Andr\'e F. T. Martins; Pedro M. Q. Aguiar

arXiv:2104.03046·cs.CV·April 8, 2021

Multimodal Continuous Visual Attention Mechanisms

Ant\'onio Farinhas, Andr\'e F. T. Martins, Pedro M. Q. Aguiar

PDF

TL;DR

This paper introduces a multimodal continuous attention mechanism using Gaussian mixtures, improving interpretability and performance in visual tasks by modeling complex, non-contiguous image regions.

Contribution

The paper proposes a novel continuous attention model with multimodal densities, enabling better modeling of complex image regions and improved interpretability over unimodal approaches.

Findings

01

Achieves competitive accuracy on VQA-v2 dataset.

02

Produces attention maps that closely resemble human attention.

03

Automatically segregates objects in complex scenes.

Abstract

Visual attention mechanisms are a key component of neural network models for computer vision. By focusing on a discrete set of objects or image regions, these mechanisms identify the most relevant features and use them to build more powerful representations. Recently, continuous-domain alternatives to discrete attention models have been proposed, which exploit the continuity of images. These approaches model attention as simple unimodal densities (e.g. a Gaussian), making them less suitable to deal with images whose region of interest has a complex shape or is composed of multiple non-contiguous patches. In this paper, we introduce a new continuous attention mechanism that produces multimodal densities, in the form of mixtures of Gaussians. We use the EM algorithm to obtain a clustering of relevant regions in the image, and a description length penalty to select the number of components…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.