TL;DR
This paper introduces an enhanced attention mechanism with an Attention on Attention module within an encoder-decoder framework for Visual Question Answering, significantly improving accuracy on the VQA-v2 benchmark.
Contribution
It proposes a novel Attention on Attention (AoA) module and a multimodal fusion approach, advancing the state-of-the-art in VQA performance.
Findings
Achieves state-of-the-art results on VQA-v2 dataset
Demonstrates the effectiveness of AoA in capturing complex dependencies
Improves multimodal information integration
Abstract
We consider the problem of Visual Question Answering (VQA). Given an image and a free-form, open-ended, question, expressed in natural language, the goal of VQA system is to provide accurate answer to this question with respect to the image. The task is challenging because it requires simultaneous and intricate understanding of both visual and textual information. Attention, which captures intra- and inter-modal dependencies, has emerged as perhaps the most widely used mechanism for addressing these challenges. In this paper, we propose an improved attention-based architecture to solve VQA. We incorporate an Attention on Attention (AoA) module within encoder-decoder framework, which is able to determine the relation between attention results and queries. Attention module generates weighted average for each query. On the other hand, AoA module first generates an information vector and an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodsfast speak--How do I Speak to someone at Expedia?
