Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via   Text-Only Training

Longtian Qiu; Shan Ning; Xuming He

arXiv:2401.02347·cs.CV·January 5, 2024·1 cites

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Longtian Qiu, Shan Ning, Xuming He

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel zero-shot image captioning method that reduces the modality gap in CLIP's latent space using subregion features, noise injection, and reranking, achieving significant improvements without caption data.

Contribution

The paper proposes a new framework for zero-shot captioning that leverages text-only training and addresses CLIP's modality gap through subregion aggregation and noise strategies.

Findings

01

Achieves state-of-the-art zero-shot captioning performance on MSCOCO and Flickr30k.

02

Demonstrates the effectiveness of subregion features and noise injection in reducing modality gap.

03

Extends the approach to zero-shot VQA, showing its versatility.

Abstract

Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

artanic30/maccap
pytorchOfficial

Videos

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training