Investigating Prompting Techniques for Zero- and Few-Shot Visual   Question Answering

Rabiul Awal; Le Zhang; Aishwarya Agrawal

arXiv:2306.09996·cs.CV·February 11, 2025·5 cites

Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering

Rabiul Awal, Le Zhang, Aishwarya Agrawal

PDF

Open Access 1 Repo

TL;DR

This paper investigates various prompting strategies, including question templates, image caption augmentation, and chain-of-thought reasoning, to improve zero- and few-shot visual question answering performance in vision-language models.

Contribution

It systematically evaluates the impact of templates, captions, and reasoning methods, proposing a simple LLM-guided pre-processing technique to better handle open-ended VQA responses.

Findings

01

Template choice significantly affects VQA outcomes.

02

Caption augmentation improves model performance despite direct image access.

03

Self-consistency reasoning can recover performance drops from standard CoT.

Abstract

In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contemporary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs' performance in many cases, even though VLMs "see" the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rabiulcste/vqazero
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning