TL;DR
This paper introduces a novel image captioning method that leverages only the CLIP model and text data, using noise injection to bridge embedding gaps, achieving state-of-the-art zero-shot results without captioned images.
Contribution
The paper presents a new approach for image captioning that trains a decoder from CLIP textual embeddings to text using only text data and noise injection, eliminating the need for captioned images.
Findings
Achieves state-of-the-art zero-shot image captioning on four benchmarks.
Effective style transfer demonstrated in captioning.
Noise injection improves embedding space alignment.
Abstract
We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available on GitHub.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsCosine Annealing · Linear Layer · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Dropout · Adam · Discriminative Fine-Tuning · Weight Decay · Byte Pair Encoding
