Attention on Attention: Architectures for Visual Question Answering (VQA)
Jasdeep Singh, Vincent Ying, Alex Nutkiewicz

TL;DR
This paper introduces thirteen new attention mechanisms and a simplified classifier for VQA, achieving state-of-the-art results through extensive hyperparameter tuning and architecture search.
Contribution
It presents novel attention mechanisms and a streamlined classifier that improve VQA performance over previous models.
Findings
Achieved an evaluation score of 64.78% on VQA tasks.
Outperformed the previous state-of-the-art single model score of 63.15%.
Demonstrated the effectiveness of new attention architectures through extensive experiments.
Abstract
Visual Question Answering (VQA) is an increasingly popular topic in deep learning research, requiring coordination of natural language processing and computer vision modules into a single architecture. We build upon the model which placed first in the VQA Challenge by developing thirteen new attention mechanisms and introducing a simplified classifier. We performed 300 GPU hours of extensive hyperparameter and architecture searches and were able to achieve an evaluation score of 64.78%, outperforming the existing state-of-the-art single model's validation score of 63.15%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
