PromptCap: Prompt-Guided Task-Aware Image Captioning

Yushi Hu; Hang Hua; Zhengyuan Yang; Weijia Shi; Noah A Smith; Jiebo; Luo

arXiv:2211.09699·cs.CV·August 21, 2023·29 cites

PromptCap: Prompt-Guided Task-Aware Image Captioning

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, Jiebo, Luo

PDF

Open Access 1 Repo 1 Models

TL;DR

PromptCap is a prompt-guided image captioning model that generates task-aware captions to improve knowledge-based visual question answering, outperforming generic captioning methods and achieving state-of-the-art results.

Contribution

It introduces a novel prompt-guided captioning approach that enhances image descriptions for better VQA performance, trained with GPT-3 synthesized data.

Findings

01

Significantly outperforms generic captions on VQA tasks

02

Achieves state-of-the-art accuracy on OK-VQA and A-OKVQA datasets

03

Generalizes well to unseen domains in zero-shot settings

Abstract

Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yushi-Hu/PromptCap
pytorchOfficial

Models

🤗
tifa-benchmark/promptcap-coco-vqa
model· 42 dl· ♡ 13
42 dl♡ 13

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Residual Connection · Cosine Annealing · Linear Layer · Adam · {Dispute@FaQ-s}How to file a dispute with Expedia? · Refunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Softmax