Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering
Rabiul Awal, Le Zhang, Aishwarya Agrawal

TL;DR
This paper investigates various prompting strategies, including question templates, image caption augmentation, and chain-of-thought reasoning, to improve zero- and few-shot visual question answering performance in vision-language models.
Contribution
It systematically evaluates the impact of templates, captions, and reasoning methods, proposing a simple LLM-guided pre-processing technique to better handle open-ended VQA responses.
Findings
Template choice significantly affects VQA outcomes.
Caption augmentation improves model performance despite direct image access.
Self-consistency reasoning can recover performance drops from standard CoT.
Abstract
In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contemporary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs' performance in many cases, even though VLMs "see" the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
