DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training
Wei Li, Linchao Zhu, Longyin Wen, Yi Yang

TL;DR
DeCap is a lightweight, text-only trained decoder that leverages CLIP embeddings for zero-shot image captioning, reducing data and computation requirements while outperforming existing methods.
Contribution
The paper introduces DeCap, a novel zero-shot captioning framework using a visual-aware language decoder trained only on text data, with a training-free modality gap reduction mechanism.
Findings
DeCap outperforms existing zero-shot captioning methods on MSCOCO and NoCaps.
The approach requires only text data for training, simplifying data collection.
The modality gap reduction improves the alignment between visual and textual embeddings.
Abstract
Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest. Prior arts approach to zero-shot captioning by either utilizing the existing large language models (e.g., GPT-2) or pre-training the encoder-decoder network in an end-to-end manner. In this work, we propose a simple framework, named DeCap, for zero-shot captioning. We introduce a lightweight visual-aware language decoder. This decoder is both data-efficient and computation-efficient: 1) it only requires the text data for training, easing the burden on the collection of paired data. 2) it does not require end-to-end training. When trained with text-only data, the decoder takes the text embedding extracted from the off-the-shelf CLIP encoder as a prefix…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
