Attention on Attention: Architectures for Visual Question Answering   (VQA)

Jasdeep Singh; Vincent Ying; Alex Nutkiewicz

arXiv:1803.07724·cs.CL·March 22, 2018·21 cites

Attention on Attention: Architectures for Visual Question Answering (VQA)

Jasdeep Singh, Vincent Ying, Alex Nutkiewicz

PDF

Open Access 3 Repos

TL;DR

This paper introduces thirteen new attention mechanisms and a simplified classifier for VQA, achieving state-of-the-art results through extensive hyperparameter tuning and architecture search.

Contribution

It presents novel attention mechanisms and a streamlined classifier that improve VQA performance over previous models.

Findings

01

Achieved an evaluation score of 64.78% on VQA tasks.

02

Outperformed the previous state-of-the-art single model score of 63.15%.

03

Demonstrated the effectiveness of new attention architectures through extensive experiments.

Abstract

Visual Question Answering (VQA) is an increasingly popular topic in deep learning research, requiring coordination of natural language processing and computer vision modules into a single architecture. We build upon the model which placed first in the VQA Challenge by developing thirteen new attention mechanisms and introducing a simplified classifier. We performed 300 GPU hours of extensive hyperparameter and architecture searches and were able to achieve an evaluation score of 64.78%, outperforming the existing state-of-the-art single model's validation score of 63.15%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning