LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation
Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Usman Naseem, Chunyu Wang, Chunyu Wang, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu, Liang Hu

TL;DR
This paper introduces LLM2CLIP, a method that integrates large language models into CLIP to enhance cross-modal representations, significantly improving performance on various image-text tasks with minimal additional training.
Contribution
The authors propose an efficient fine-tuning framework that embeds LLMs into CLIP, boosting multimodal understanding without large-scale retraining or high computational costs.
Findings
Outperforms state-of-the-art CLIP variants like EVA02 and SigLIP-2.
Improves zero-shot image-text retrieval for long and complex captions.
Enhances performance across multiple downstream tasks including classification and segmentation.
Abstract
CLIP is a seminal multimodal model that maps images and text into a shared representation space through contrastive learning on billions of image-caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate how the superior linguistic understanding and broad world knowledge of LLMs can further strengthen CLIP, particularly in handling long and complex captions. We introduce an efficient fine-tuning framework that embeds an LLM into a pretrained CLIP while incurring nearly the same training cost as standard CLIP fine-tuning. Our method first converts the LLM into an embedding-compatible form for the CLIP setting, and then couples it with the pretrained CLIP vision encoder through a lightweight adaptor trained on only a few million image-caption pairs. With this strategy, we achieve large performance gains without large-scale retraining, outperforming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/LLM2CLIP-EVA02-L-14-336model· 54 dl· ♡ 6054 dl♡ 60
- 🤗microsoft/LLM2CLIP-Openai-L-14-336model· 8.5k dl· ♡ 438.5k dl♡ 43
- 🤗microsoft/LLM2CLIP-EVA02-B-16model· 71 dl· ♡ 1071 dl♡ 10
- 🤗microsoft/LLM2CLIP-Openai-B-16model· 1.8k dl· ♡ 181.8k dl♡ 18
- 🤗microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetunedmodel· 10k dl· ♡ 3810k dl♡ 38
- 🤗microsoft/LLM2CLIP-Openai-L-14-224model· 81 dl· ♡ 581 dl♡ 5
- 🤗microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetunedmodel· 1.7k dl· ♡ 91.7k dl♡ 9
- 🤗microsoft/LLM2CLIP-Llama3.2-1B-EVA02-L-14-336model· ♡ 10♡ 10
- 🤗microsoft/LLM2CLIP-Llama3.1-8B-siglip2-so400m-patch14-224model· ♡ 9♡ 9
- 🤗HugC/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetunedmodel· 2 dl2 dl
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
MethodsAttention Is All You Need · Contrastive Learning · Adam · Linear Layer · Absolute Position Encodings · Multi-Head Attention · Residual Connection · Softmax · Byte Pair Encoding · Dropout
