Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering
Man Luo, Yankai Zeng, Pratyay Banerjee, Chitta Baral

TL;DR
This paper introduces a weakly-supervised visual retriever-reader pipeline for knowledge-based VQA, utilizing a newly collected universal knowledge base to improve answer accuracy on the OK-VQA dataset.
Contribution
It proposes a novel weakly-supervised retriever-reader framework and a universal knowledge base for fair comparison and improved performance in knowledge-based VQA.
Findings
A strong retriever significantly boosts reader performance.
The proposed methods outperform baselines on OK-VQA.
The universal knowledge base enables fairer model comparisons.
Abstract
Knowledge-based visual question answering (VQA) requires answering questions with external knowledge in addition to the content of images. One dataset that is mostly used in evaluating knowledge-based VQA is OK-VQA, but it lacks a gold standard knowledge corpus for retrieval. Existing work leverage different knowledge bases (e.g., ConceptNet and Wikipedia) to obtain external knowledge. Because of varying knowledge bases, it is hard to fairly compare models' performance. To address this issue, we collect a natural language knowledge base that can be used for any VQA system. Moreover, we propose a Visual Retriever-Reader pipeline to approach knowledge-based VQA. The visual retriever aims to retrieve relevant knowledge, and the visual reader seeks to predict answers based on given knowledge. We introduce various ways to retrieve knowledge using text and images and two reader styles:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
