Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Longtian Qiu, Shan Ning, Xuming He

TL;DR
This paper introduces a novel zero-shot image captioning method that reduces the modality gap in CLIP's latent space using subregion features, noise injection, and reranking, achieving significant improvements without caption data.
Contribution
The paper proposes a new framework for zero-shot captioning that leverages text-only training and addresses CLIP's modality gap through subregion aggregation and noise strategies.
Findings
Achieves state-of-the-art zero-shot captioning performance on MSCOCO and Flickr30k.
Demonstrates the effectiveness of subregion features and noise injection in reducing modality gap.
Extends the approach to zero-shot VQA, showing its versatility.
Abstract
Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
