Linear Alignment of Vision-language Models for Image Captioning

Fabian Paischer; Markus Hofmarcher; Sepp Hochreiter; Thomas Adler

arXiv:2307.05591·cs.CV·February 11, 2025

Linear Alignment of Vision-language Models for Image Captioning

Fabian Paischer, Markus Hofmarcher, Sepp Hochreiter, Thomas Adler

PDF

Open Access 1 Repo

TL;DR

This paper introduces ReCap, a fast and efficient image captioning method that aligns CLIP's joint embedding space linearly, improving performance and metric correlation with human judgment across multiple datasets.

Contribution

The paper proposes a novel linear alignment technique for CLIP embeddings, enabling rapid training and superior captioning performance with new CLIP-based evaluation metrics.

Findings

01

ReCap is up to 1000 times faster to train than existing methods.

02

ReCap outperforms competitors on CLIP-based metrics across multiple datasets.

03

Proposed metrics correlate more strongly with human judgment than existing ones.

Abstract

Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches leverage CLIP for cross-modal retrieval to condition pre-trained language models on visual input. However, CLIP generally suffers from a mis-alignment of image and text modalities in the joint embedding space. We investigate efficient methods to linearly re-align the joint embedding space for the downstream task of image captioning. This leads to an efficient training protocol that merely requires computing a closed-form solution for a linear mapping in the joint CLIP space. Consequently, we propose a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics built on CLIP score…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ml-jku/semantic-image-text-alignment
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training