ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched   Visual Descriptions

Deyao Zhu; Jun Chen; Kilichbek Haydarov; Xiaoqian Shen; Wenxuan Zhang,; Mohamed Elhoseiny

arXiv:2303.06594·cs.CV·March 14, 2023·40 cites

ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions

Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang,, Mohamed Elhoseiny

PDF

Open Access 1 Repo

TL;DR

This paper introduces ChatCaptioner, an automatic questioning system that uses ChatGPT to ask questions about images, leading to more detailed and informative image descriptions by leveraging BLIP-2's visual understanding.

Contribution

It presents a novel method combining ChatGPT and BLIP-2 for automatic image captioning through question-driven information gathering.

Findings

01

ChatCaptioner produces significantly more informative captions.

02

It identifies 53% more objects than BLIP-2 alone.

03

Human evaluators prefer ChatCaptioner's captions over baselines.

Abstract

Asking insightful questions is crucial for acquiring knowledge and expanding our understanding of the world. However, the importance of questioning has been largely overlooked in AI research, where models have been primarily developed to answer questions. With the recent advancements of large language models (LLMs) like ChatGPT, we discover their capability to ask high-quality questions when provided with a suitable prompt. This discovery presents a new opportunity to develop an automatic questioning system. In this paper, we introduce ChatCaptioner, a novel automatic-questioning method deployed in image captioning. Here, ChatGPT is prompted to ask a series of informative questions about images to BLIP-2, a strong vision question-answering model. By keeping acquiring new visual information from BLIP-2's answers, ChatCaptioner is able to generate more enriched image descriptions. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vision-cair/chatcaptioner
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling