Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering
Hantao Huang, Tao Han, Wei Han, Deep Yap, Cheng-Ming Chiang

TL;DR
This paper introduces a fully attention-based VQA model with an answer-checking module that mimics human verification, achieving state-of-the-art accuracy on VQA-v2.0 with fewer parameters.
Contribution
It proposes a novel multi-modal fully attention network with an answer-checking module that enhances answer verification in VQA tasks.
Findings
Achieves 71.57% accuracy on VQA-v2.0 test-standard split.
Uses fewer parameters than previous state-of-the-art models.
Demonstrates improved answer verification through a unified attention mechanism.
Abstract
Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57\% using fewer parameters on VQA-v2.0 test-standard split.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsLinear Layer · WordPiece · Adam · Softmax · Layer Normalization · Dense Connections · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Linear Warmup With Linear Decay
