Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction
Rui Fonseca, Bruno Martins, Gil Rocha

TL;DR
TOMCap introduces a novel text-only training approach for image captioning that leverages retrieval augmentation and modality gap correction, outperforming existing methods without requiring image-caption pairs.
Contribution
The paper presents TOMCap, a new method that enables effective image captioning using only text data by combining retrieval-augmented prompts and modality gap reduction techniques.
Findings
TOMCap outperforms other text-only and training-free captioning methods.
Retrieval-augmentation improves caption quality.
Modality gap correction enhances model performance.
Abstract
Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
