Simple Baseline for Visual Question Answering

Bolei Zhou; Yuandong Tian; Sainbayar Sukhbaatar; Arthur Szlam; and Rob Fergus

arXiv:1512.02167·cs.CV·December 16, 2015·292 cites

Simple Baseline for Visual Question Answering

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus

PDF

Open Access 5 Repos

TL;DR

This paper introduces a straightforward bag-of-words baseline for visual question answering that combines question and image features, achieving performance comparable to more complex models on the VQA dataset.

Contribution

It presents a simple, effective baseline model for VQA that challenges the necessity of complex architectures, with open-source code and an interactive demo.

Findings

01

Baseline achieves comparable results to recent RNN-based models.

02

Simple concatenation of features is surprisingly effective.

03

Open-source code and demo facilitate further research.

Abstract

We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning