Linear Alignment of Vision-language Models for Image Captioning
Fabian Paischer, Markus Hofmarcher, Sepp Hochreiter, Thomas Adler

TL;DR
This paper introduces ReCap, a fast and efficient image captioning method that aligns CLIP's joint embedding space linearly, improving performance and metric correlation with human judgment across multiple datasets.
Contribution
The paper proposes a novel linear alignment technique for CLIP embeddings, enabling rapid training and superior captioning performance with new CLIP-based evaluation metrics.
Findings
ReCap is up to 1000 times faster to train than existing methods.
ReCap outperforms competitors on CLIP-based metrics across multiple datasets.
Proposed metrics correlate more strongly with human judgment than existing ones.
Abstract
Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches leverage CLIP for cross-modal retrieval to condition pre-trained language models on visual input. However, CLIP generally suffers from a mis-alignment of image and text modalities in the joint embedding space. We investigate efficient methods to linearly re-align the joint embedding space for the downstream task of image captioning. This leads to an efficient training protocol that merely requires computing a closed-form solution for a linear mapping in the joint CLIP space. Consequently, we propose a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics built on CLIP score…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
