Tips and Tricks for Visual Question Answering: Learnings from the 2017   Challenge

Damien Teney; Peter Anderson; Xiaodong He; Anton van den Hengel

arXiv:1708.02711·cs.CV·August 10, 2017

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel

PDF

5 Repos

TL;DR

This paper details a high-performing, simple model for visual question answering (VQA) that won the 2017 VQA Challenge, highlighting architecture choices and hyperparameters that significantly improve performance.

Contribution

The paper introduces a set of effective tips and tricks for VQA model design, derived from extensive experimentation, to guide future research and development.

Findings

01

Sigmoid outputs improve accuracy

02

Image features from bottom-up attention enhance performance

03

Large mini-batches and smart shuffling are beneficial

Abstract

This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge. VQA is a task of significant importance for research in artificial intelligence, given its multimodal nature, clear evaluation protocol, and potential real-world applications. The performance of deep neural networks for VQA is very dependent on choices of architectures and hyperparameters. To help further research in the area, we describe in detail our high-performing, though relatively simple model. Through a massive exploration of architectures and hyperparameters representing more than 3,000 GPU-hours, we identified tips and tricks that lead to its success, namely: sigmoid outputs, soft training targets, image features from bottom-up attention, gated tanh activations, output embeddings initialized using GloVe and Google Images, large mini-batches, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsGloVe Embeddings