DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only   Training

Wei Li; Linchao Zhu; Longyin Wen; Yi Yang

arXiv:2303.03032·cs.CV·March 7, 2023·24 cites

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Wei Li, Linchao Zhu, Longyin Wen, Yi Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

DeCap is a lightweight, text-only trained decoder that leverages CLIP embeddings for zero-shot image captioning, reducing data and computation requirements while outperforming existing methods.

Contribution

The paper introduces DeCap, a novel zero-shot captioning framework using a visual-aware language decoder trained only on text data, with a training-free modality gap reduction mechanism.

Findings

01

DeCap outperforms existing zero-shot captioning methods on MSCOCO and NoCaps.

02

The approach requires only text data for training, simplifying data collection.

03

The modality gap reduction improves the alignment between visual and textual embeddings.

Abstract

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest. Prior arts approach to zero-shot captioning by either utilizing the existing large language models (e.g., GPT-2) or pre-training the encoder-decoder network in an end-to-end manner. In this work, we propose a simple framework, named DeCap, for zero-shot captioning. We introduce a lightweight visual-aware language decoder. This decoder is both data-efficient and computation-efficient: 1) it only requires the text data for training, easing the burden on the collection of paired data. 2) it does not require end-to-end training. When trained with text-only data, the decoder takes the text embedding extracted from the off-the-shelf CLIP encoder as a prefix…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dhg-wei/decap
pytorchOfficial

Videos

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training