WeaQA: Weak Supervision via Captions for Visual Question Answering
Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, Chitta Baral

TL;DR
This paper introduces WeaQA, a weakly-supervised approach for visual question answering that relies on captions instead of human-annotated Q-A pairs, improving generalization and reducing annotation costs.
Contribution
The paper proposes a novel method to train VQA models using only images and captions, generating synthetic Q-A pairs and utilizing spatial image patches, reducing reliance on costly annotations.
Findings
Effective on three VQA benchmarks
Improves performance on VQA-CP challenge
Reduces need for human-annotated Q-A datasets
Abstract
Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated \textit{Image-Question-Answer} (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
