Simple Baseline for Visual Question Answering
Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus

TL;DR
This paper introduces a straightforward bag-of-words baseline for visual question answering that combines question and image features, achieving performance comparable to more complex models on the VQA dataset.
Contribution
It presents a simple, effective baseline model for VQA that challenges the necessity of complex architectures, with open-source code and an interactive demo.
Findings
Baseline achieves comparable results to recent RNN-based models.
Simple concatenation of features is surprisingly effective.
Open-source code and demo facilitate further research.
Abstract
We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
