A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA
Yangyang Guo, Liqiang Nie, Yongkang Wong, Yibing Liu, Zhiyong Cheng, and Mohan Kankanhalli

TL;DR
This paper introduces a unified end-to-end retriever-reader framework for knowledge-based VQA that leverages multi-modal implicit knowledge and a novel pseudo-label scheme to improve answer accuracy.
Contribution
It proposes a novel end-to-end framework that utilizes multi-modal implicit knowledge and pseudo-labeling to enhance knowledge retrieval and reasoning in VQA.
Findings
Outperforms existing baselines on benchmark datasets.
Effectively mitigates noise and error propagation in knowledge retrieval.
Provides new insights into multi-modal implicit knowledge for VQA.
Abstract
Knowledge-based Visual Question Answering (VQA) expects models to rely on external knowledge for robust answer prediction. Though significant it is, this paper discovers several leading factors impeding the advancement of current state-of-the-art methods. On the one hand, methods which exploit the explicit knowledge take the knowledge as a complement for the coarsely trained VQA model. Despite their effectiveness, these approaches often suffer from noise incorporation and error propagation. On the other hand, pertaining to the implicit knowledge, the multi-modal implicit knowledge for knowledge-based VQA still remains largely unexplored. This work presents a unified end-to-end retriever-reader framework towards knowledge-based VQA. In particular, we shed light on the multi-modal implicit knowledge from vision-language pre-training models to mine its potential in knowledge reasoning. As…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
