WeaQA: Weak Supervision via Captions for Visual Question Answering

Pratyay Banerjee; Tejas Gokhale; Yezhou Yang; Chitta Baral

arXiv:2012.02356·cs.CV·May 31, 2021

WeaQA: Weak Supervision via Captions for Visual Question Answering

Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, Chitta Baral

PDF

TL;DR

This paper introduces WeaQA, a weakly-supervised approach for visual question answering that relies on captions instead of human-annotated Q-A pairs, improving generalization and reducing annotation costs.

Contribution

The paper proposes a novel method to train VQA models using only images and captions, generating synthetic Q-A pairs and utilizing spatial image patches, reducing reliance on costly annotations.

Findings

01

Effective on three VQA benchmarks

02

Improves performance on VQA-CP challenge

03

Reduces need for human-annotated Q-A datasets

Abstract

Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated \textit{Image-Question-Answer} (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.