Zero-Shot Visual Question Answering
Damien Teney, Anton van den Hengel

TL;DR
This paper introduces a new evaluation protocol for Zero-Shot Visual Question Answering (VQA), highlighting current methods' limitations and proposing strategies based on semantic embeddings and retrieval, with extensive experiments serving as baselines.
Contribution
It proposes a novel evaluation protocol for Zero-Shot VQA and evaluates multiple strategies, establishing new baselines and revealing practical deficiencies in existing methods.
Findings
Current VQA methods struggle with Zero-Shot tasks.
Semantic embeddings improve Zero-Shot VQA performance.
The proposed strategies outperform previous approaches in baseline experiments.
Abstract
Part of the appeal of Visual Question Answering (VQA) is its promise to answer new questions about previously unseen images. Most current methods demand training questions that illustrate every possible concept, and will therefore never achieve this capability, since the volume of required training data would be prohibitive. Answering general questions about images requires methods capable of Zero-Shot VQA, that is, methods able to answer questions beyond the scope of the training questions. We propose a new evaluation protocol for VQA methods which measures their ability to perform Zero-Shot VQA, and in doing so highlights significant practical deficiencies of current approaches, some of which are masked by the biases in current datasets. We propose and evaluate several strategies for achieving Zero-Shot VQA, including methods based on pretrained word embeddings, object classifiers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
