Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
Dongze Hao, Qunbo Wang, Longteng Guo, Jie Jiang, Jing Liu

TL;DR
This paper introduces a self-bootstrapped visual-language framework that improves knowledge selection and question answering in open-domain VQA by iteratively refining key knowledge retrieval and answer prediction, achieving state-of-the-art results.
Contribution
The proposed method leverages a visual-language model to select relevant knowledge and iteratively fine-tune both knowledge selection and answering modules, enhancing open-domain VQA performance.
Findings
Achieved 62.83% accuracy on OK-VQA benchmark.
Significantly outperforms baseline models.
Demonstrates effectiveness of self-bootstrapping in knowledge selection.
Abstract
While large visual-language models (LVLM) have shown promising results on traditional visual question answering benchmarks, it is still challenging for them to answer complex VQA problems which requires diverse world knowledge. Motivated by the research of retrieval-augmented generation in the field of natural language processing, we use Dense Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions. However, DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information. Thus, the retrieved knowledge is not truly conducive to helping answer the question, affecting the performance of the overall system. To address this issue, we propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions. The framework consists of two modules:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFault Detection and Control Systems
