Modulated Self-attention Convolutional Network for VQA
Jean-Benoit Delbrouck, Antoine Maiorca, Nathan Hubens and, St\'ephane Dupont

TL;DR
This paper introduces a modulated self-attention convolutional network that integrates linguistic input to enhance visual feature extraction for VQA, showing promising initial improvements.
Contribution
It proposes a novel CNN architecture augmented with self-attention modulated by language input for improved visual reasoning in VQA.
Findings
Encouraging relative improvements in VQA performance.
Demonstrates the potential of language-modulated self-attention in CNNs.
Provides a foundation for future research in end-to-end visual feature extraction.
Abstract
As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training. This small contribution aims to suggest new ideas to improve the visual processing of traditional convolutional network for visual question answering (VQA). In this paper, we propose to modulate by a linguistic input a CNN augmented with self-attention. We show encouraging relative improvements for future research in this direction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
