Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
Zhou Yu, Xuecheng Ouyang, Zhenwei Shao, Meng Wang, Jun Yu

TL;DR
Prophet enhances knowledge-based visual question answering by prompting large language models with answer heuristics derived from a trained VQA model, significantly improving accuracy across multiple datasets.
Contribution
Introducing Prophet, a flexible framework that combines answer heuristics from a VQA model with LLM prompting, advancing knowledge-based VQA performance.
Findings
Outperforms state-of-the-art on four datasets
Effective with various VQA models and LLMs
Can be integrated with multimodal models for further gains
Abstract
Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of the \emph{blind} LLM as the provided textual input is insufficient to depict the required visual information to answer the question. In this paper, we present Prophet -- a conceptually simple, flexible, and general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Attention Dropout · Cosine Annealing · Linear Warmup With Cosine Annealing · Layer Normalization · Residual Connection · {Dispute@FaQ-s}How to file a dispute with Expedia?
