Modulated Self-attention Convolutional Network for VQA

Jean-Benoit Delbrouck; Antoine Maiorca; Nathan Hubens and; St\'ephane Dupont

arXiv:1910.03343·cs.CV·November 1, 2019

Modulated Self-attention Convolutional Network for VQA

Jean-Benoit Delbrouck, Antoine Maiorca, Nathan Hubens and, St\'ephane Dupont

PDF

Open Access

TL;DR

This paper introduces a modulated self-attention convolutional network that integrates linguistic input to enhance visual feature extraction for VQA, showing promising initial improvements.

Contribution

It proposes a novel CNN architecture augmented with self-attention modulated by language input for improved visual reasoning in VQA.

Findings

01

Encouraging relative improvements in VQA performance.

02

Demonstrates the potential of language-modulated self-attention in CNNs.

03

Provides a foundation for future research in end-to-end visual feature extraction.

Abstract

As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training. This small contribution aims to suggest new ideas to improve the visual processing of traditional convolutional network for visual question answering (VQA). In this paper, we propose to modulate by a linguistic input a CNN augmented with self-attention. We show encouraging relative improvements for future research in this direction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques