Dual Recurrent Attention Units for Visual Question Answering
Ahmed Osman, Wojciech Samek

TL;DR
This paper introduces a recurrent attention mechanism for visual question answering, demonstrating its superiority over traditional convolutional attention and achieving state-of-the-art results on multiple VQA datasets.
Contribution
The paper proposes dual Recurrent Attention Units (RAUs) for VQA, showing their effectiveness and improving performance over existing models and attention mechanisms.
Findings
Outperforms the first place on VQA 2016 challenge
Second best on VQA 1.0 dataset
Improves performance of state-of-the-art models
Abstract
Visual Question Answering (VQA) requires AI models to comprehend data in two domains, vision and text. Current state-of-the-art models use learned attention mechanisms to extract relevant information from the input domains to answer a certain question. Thus, robust attention mechanisms are essential for powerful VQA models. In this paper, we propose a recurrent attention mechanism and show its benefits compared to the traditional convolutional approach. We perform two ablation studies to evaluate recurrent attention. First, we introduce a baseline VQA model with visual attention and test the performance difference between convolutional and recurrent attention on the VQA 2.0 dataset. Secondly, we design an architecture for VQA which utilizes dual (textual and visual) Recurrent Attention Units (RAUs). Using this model, we show the effect of all possible combinations of recurrent and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
