An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng, Liu, Lijuan Wang

TL;DR
This paper introduces PICa, a novel method that leverages GPT-3 as an implicit knowledge base for few-shot knowledge-based visual question answering, surpassing previous supervised methods on OK-VQA.
Contribution
The paper demonstrates how GPT-3 can be effectively used as an implicit, unstructured knowledge source for multimodal VQA, eliminating the need for structured knowledge bases.
Findings
PICa outperforms supervised state-of-the-art on OK-VQA by +8.6 points with only 16 examples.
PICa achieves competitive few-shot results on VQAv2 dataset.
Careful text formatting and example selection improve GPT-3's VQA performance.
Abstract
Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3's power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Softmax · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Dropout · Dense Connections · Byte Pair Encoding · Linear Warmup With Cosine Annealing · Dropout
