Answer-checking in Context: A Multi-modal FullyAttention Network for   Visual Question Answering

Hantao Huang; Tao Han; Wei Han; Deep Yap; Cheng-Ming Chiang

arXiv:2010.08708·cs.CV·October 20, 2020·1 cites

Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering

Hantao Huang, Tao Han, Wei Han, Deep Yap, Cheng-Ming Chiang

PDF

Open Access

TL;DR

This paper introduces a fully attention-based VQA model with an answer-checking module that mimics human verification, achieving state-of-the-art accuracy on VQA-v2.0 with fewer parameters.

Contribution

It proposes a novel multi-modal fully attention network with an answer-checking module that enhances answer verification in VQA tasks.

Findings

01

Achieves 71.57% accuracy on VQA-v2.0 test-standard split.

02

Uses fewer parameters than previous state-of-the-art models.

03

Demonstrates improved answer verification through a unified attention mechanism.

Abstract

Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57\% using fewer parameters on VQA-v2.0 test-standard split.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsLinear Layer · WordPiece · Adam · Softmax · Layer Normalization · Dense Connections · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Linear Warmup With Linear Decay