Prompting Vision Language Model with Knowledge from Large Language Model for Knowledge-Based VQA
Yang Zhou, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

TL;DR
This paper introduces PROOFREAD, a novel framework that combines large language models and vision-language models to improve knowledge-based visual question answering by explicitly obtaining and filtering knowledge.
Contribution
The proposed method explicitly leverages LLMs for knowledge extraction and introduces a knowledge perceiver to enhance VQA accuracy, outperforming state-of-the-art approaches.
Findings
Outperforms all SOTA methods on A-OKVQA dataset
Achieves good performance on OKVQA dataset
Effectively filters harmful knowledge for accurate answers
Abstract
Knowledge-based visual question answering is a very challenging and widely concerned task. Previous methods adopts the implicit knowledge in large language models (LLM) to achieve excellent results, but we argue that existing methods may suffer from biasing understanding of the image and insufficient knowledge to solve the problem. In this paper, we propose PROOFREAD -PROmpting vision language model with knOwledge From laRgE lAnguage moDel, a novel, lightweight and efficient kowledge-based VQA framework, which make the vision language model and the large language model cooperate to give full play to their respective strengths and bootstrap each other. In detail, our proposed method uses LLM to obtain knowledge explicitly, uses the vision language model which can see the image to get the knowledge answer, and introduces knowledge perceiver to filter out knowledge that is harmful for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
