ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
Xiaoxing Hu, Kaicheng Yang, Ziyang Gong, Qi Ming, Zonghao Guo, Yu Tian, Xiang An, Ziyong Feng, Xue Yang

TL;DR
ProCLIP introduces a progressive curriculum learning framework to align CLIP's image encoder with an LLM-based text embedder, enhancing long-text processing, multilingual understanding, and semantic comprehension in vision-language tasks.
Contribution
It proposes a novel curriculum learning-based method for progressive alignment of CLIP with an LLM embedder, addressing previous limitations in handling long and multilingual texts.
Findings
Improved alignment between CLIP image encoder and LLM embedder.
Enhanced performance on long and multilingual text tasks.
Effective knowledge transfer from CLIP's text encoder to LLM embedder.
Abstract
The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
