Text Data-Centric Image Captioning with Interactive Prompts
Yiyu Wang, Hao Luo, Jungang Xu, Yingfei Sun, Fan Wang

TL;DR
This paper introduces TIPCap, a novel image captioning approach that reduces reliance on paired data by using a Gaussian-driven mapping and interactive prompts, achieving state-of-the-art results on MS-COCO and Flickr30K.
Contribution
The paper presents a unified, data-centric image captioning method with interactive prompts that adapts across various data configurations and improves captioning accuracy.
Findings
Outperforms existing weakly and unsupervised methods.
Achieves state-of-the-art results on MS-COCO and Flickr30K.
Effectively reduces dependence on paired data.
Abstract
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and language models (e.g., CLIP) and large-scale generative language models (e.g., GPT-2) have shown strong performances in various tasks, which also provide some new solutions for image captioning with web paired data, unpaired data or even text-only data. Among them, the mainstream solution is to project image embeddings into the text embedding space with the assistance of consistent representations between image-text pairs from the CLIP model. However, the current methods still face several challenges in adapting to the diversity of data configurations in a unified solution, accurately estimating image-text embedding bias, and correcting unsatisfactory prediction results in the inference stage. This paper proposes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
