Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question   Answering

Vahid Kazemi; Ali Elqursh

arXiv:1704.03162·cs.CV·April 13, 2017·149 cites

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering

Vahid Kazemi, Ali Elqursh

PDF

Open Access 5 Repos

TL;DR

This paper introduces a simple yet effective baseline model for visual question answering that achieves state-of-the-art results on major benchmarks without using extra data.

Contribution

The paper proposes a straightforward model architecture that outperforms previous methods on VQA benchmarks, setting new state-of-the-art results.

Findings

01

Achieves 64.6% accuracy on VQA 1.0 test-standard

02

Scores 59.7% on VQA 2.0 validation set

03

Model is simple, small, and effective

Abstract

This paper presents a new baseline for visual question answering task. Given an image and a question in natural language, our model produces accurate answers according to the content of the image. Our model, while being architecturally simple and relatively small in terms of trainable parameters, sets a new state of the art on both unbalanced and balanced VQA benchmark. On VQA 1.0 open ended challenge, our model achieves 64.6% accuracy on the test-standard set without using additional data, an improvement of 0.4% over state of the art, and on newly released VQA 2.0, our model scores 59.7% on validation set outperforming best previously reported results by 0.5%. The results presented in this paper are especially interesting because very similar models have been tried before but significantly lower performance were reported. In light of the new results we hope to see more meaningful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsAverage Pooling · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling