Prompting Vision Language Model with Knowledge from Large Language Model   for Knowledge-Based VQA

Yang Zhou; Pengfei Cao; Yubo Chen; Kang Liu; Jun Zhao

arXiv:2308.15851·cs.MM·August 31, 2023·1 cites

Prompting Vision Language Model with Knowledge from Large Language Model for Knowledge-Based VQA

Yang Zhou, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

PDF

Open Access

TL;DR

This paper introduces PROOFREAD, a novel framework that combines large language models and vision-language models to improve knowledge-based visual question answering by explicitly obtaining and filtering knowledge.

Contribution

The proposed method explicitly leverages LLMs for knowledge extraction and introduces a knowledge perceiver to enhance VQA accuracy, outperforming state-of-the-art approaches.

Findings

01

Outperforms all SOTA methods on A-OKVQA dataset

02

Achieves good performance on OKVQA dataset

03

Effectively filters harmful knowledge for accurate answers

Abstract

Knowledge-based visual question answering is a very challenging and widely concerned task. Previous methods adopts the implicit knowledge in large language models (LLM) to achieve excellent results, but we argue that existing methods may suffer from biasing understanding of the image and insufficient knowledge to solve the problem. In this paper, we propose PROOFREAD -PROmpting vision language model with knOwledge From laRgE lAnguage moDel, a novel, lightweight and efficient kowledge-based VQA framework, which make the vision language model and the large language model cooperate to give full play to their respective strengths and bootstrap each other. In detail, our proposed method uses LLM to obtain knowledge explicitly, uses the vision language model which can see the image to get the knowledge answer, and introduces knowledge perceiver to filter out knowledge that is harmful for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques