LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation

Weiquan Huang; Aoqi Wu; Yifan Yang; Xufang Luo; Yuqing Yang; Usman Naseem; Chunyu Wang; Chunyu Wang; Qi Dai; Xiyang Dai; Dongdong Chen; Chong Luo; Lili Qiu; Liang Hu

arXiv:2411.04997·cs.CV·February 26, 2026·2 cites

LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation

Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Usman Naseem, Chunyu Wang, Chunyu Wang, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu, Liang Hu

PDF

Open Access 1 Repo 10 Models 1 Video

TL;DR

This paper introduces LLM2CLIP, a method that integrates large language models into CLIP to enhance cross-modal representations, significantly improving performance on various image-text tasks with minimal additional training.

Contribution

The authors propose an efficient fine-tuning framework that embeds LLMs into CLIP, boosting multimodal understanding without large-scale retraining or high computational costs.

Findings

01

Outperforms state-of-the-art CLIP variants like EVA02 and SigLIP-2.

02

Improves zero-shot image-text retrieval for long and complex captions.

03

Enhances performance across multiple downstream tasks including classification and segmentation.

Abstract

CLIP is a seminal multimodal model that maps images and text into a shared representation space through contrastive learning on billions of image-caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate how the superior linguistic understanding and broad world knowledge of LLMs can further strengthen CLIP, particularly in handling long and complex captions. We introduce an efficient fine-tuning framework that embeds an LLM into a pretrained CLIP while incurring nearly the same training cost as standard CLIP fine-tuning. Our method first converts the LLM into an embedding-compatible form for the CLIP setting, and then couples it with the pretrained CLIP vision encoder through a lightweight adaptor trained on only a few million image-caption pairs. With this strategy, we achieve large performance gains without large-scale retraining, outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/LLM2CLIP
pytorchOfficial

Models

Videos

LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Handwritten Text Recognition Techniques

MethodsAttention Is All You Need · Contrastive Learning · Adam · Linear Layer · Absolute Position Encodings · Multi-Head Attention · Residual Connection · Softmax · Byte Pair Encoding · Dropout