Inverse Visual Question Answering with Multi-Level Attentions
Yaser Alwattar, Yuhong Guo

TL;DR
This paper introduces a deep multi-level attention model for inverse visual question answering, utilizing object-level features and dual attention mechanisms to improve answer generation, achieving state-of-the-art results on VQA V1 dataset.
Contribution
The paper presents a novel multi-level attention framework that enhances visual and semantic features for inverse VQA, with dual and dynamic attention mechanisms.
Findings
Achieves state-of-the-art performance on VQA V1 dataset
Demonstrates effectiveness of multi-level attention in inverse VQA
Improves answer accuracy with attention mechanisms
Abstract
In this paper, we propose a novel deep multi-level attention model to address inverse visual question answering. The proposed model generates regional visual and semantic features at the object level and then enhances them with the answer cue by using attention mechanisms. Two levels of multiple attentions are employed in the model, including the dual attention at the partial question encoding step and the dynamic attention at the next question word generation step. We evaluate the proposed model on the VQA V1 dataset. It demonstrates state-of-the-art performance in terms of multiple commonly used metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
