Self-Bootstrapped Visual-Language Model for Knowledge Selection and   Question Answering

Dongze Hao; Qunbo Wang; Longteng Guo; Jie Jiang; Jing Liu

arXiv:2404.13947·cs.CV·October 10, 2024

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

Dongze Hao, Qunbo Wang, Longteng Guo, Jie Jiang, Jing Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a self-bootstrapped visual-language framework that improves knowledge selection and question answering in open-domain VQA by iteratively refining key knowledge retrieval and answer prediction, achieving state-of-the-art results.

Contribution

The proposed method leverages a visual-language model to select relevant knowledge and iteratively fine-tune both knowledge selection and answering modules, enhancing open-domain VQA performance.

Findings

01

Achieved 62.83% accuracy on OK-VQA benchmark.

02

Significantly outperforms baseline models.

03

Demonstrates effectiveness of self-bootstrapping in knowledge selection.

Abstract

While large visual-language models (LVLM) have shown promising results on traditional visual question answering benchmarks, it is still challenging for them to answer complex VQA problems which requires diverse world knowledge. Motivated by the research of retrieval-augmented generation in the field of natural language processing, we use Dense Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions. However, DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information. Thus, the retrieved knowledge is not truly conducive to helping answer the question, affecting the performance of the overall system. To address this issue, we propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions. The framework consists of two modules:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haodongze/self-ksel-qans
pytorchOfficial

Videos

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering· underline

Taxonomy

TopicsFault Detection and Control Systems