Inverse Visual Question Answering with Multi-Level Attentions

Yaser Alwattar; Yuhong Guo

arXiv:1909.07583·cs.CV·December 4, 2020

Inverse Visual Question Answering with Multi-Level Attentions

Yaser Alwattar, Yuhong Guo

PDF

Open Access

TL;DR

This paper introduces a deep multi-level attention model for inverse visual question answering, utilizing object-level features and dual attention mechanisms to improve answer generation, achieving state-of-the-art results on VQA V1 dataset.

Contribution

The paper presents a novel multi-level attention framework that enhances visual and semantic features for inverse VQA, with dual and dynamic attention mechanisms.

Findings

01

Achieves state-of-the-art performance on VQA V1 dataset

02

Demonstrates effectiveness of multi-level attention in inverse VQA

03

Improves answer accuracy with attention mechanisms

Abstract

In this paper, we propose a novel deep multi-level attention model to address inverse visual question answering. The proposed model generates regional visual and semantic features at the object level and then enhances them with the answer cue by using attention mechanisms. Two levels of multiple attentions are employed in the model, including the dual attention at the partial question encoding step and the dynamic attention at the next question word generation step. We evaluate the proposed model on the VQA V1 dataset. It demonstrates state-of-the-art performance in terms of multiple commonly used metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning