Text Data-Centric Image Captioning with Interactive Prompts

Yiyu Wang; Hao Luo; Jungang Xu; Yingfei Sun; Fan Wang

arXiv:2403.19193·cs.CV·March 29, 2024·1 cites

Text Data-Centric Image Captioning with Interactive Prompts

Yiyu Wang, Hao Luo, Jungang Xu, Yingfei Sun, Fan Wang

PDF

Open Access

TL;DR

This paper introduces TIPCap, a novel image captioning approach that reduces reliance on paired data by using a Gaussian-driven mapping and interactive prompts, achieving state-of-the-art results on MS-COCO and Flickr30K.

Contribution

The paper presents a unified, data-centric image captioning method with interactive prompts that adapts across various data configurations and improves captioning accuracy.

Findings

01

Outperforms existing weakly and unsupervised methods.

02

Achieves state-of-the-art results on MS-COCO and Flickr30K.

03

Effectively reduces dependence on paired data.

Abstract

Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and language models (e.g., CLIP) and large-scale generative language models (e.g., GPT-2) have shown strong performances in various tasks, which also provide some new solutions for image captioning with web paired data, unpaired data or even text-only data. Among them, the mainstream solution is to project image embeddings into the text embedding space with the assistance of consistent representations between image-text pairs from the CLIP model. However, the current methods still face several challenges in adapting to the diversity of data configurations in a unified solution, accurately estimating image-text embedding bias, and correcting unsatisfactory prediction results in the inference stage. This paper proposes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training