Language-Image Alignment with Fixed Text Encoders
Jingfeng Yang, Ziyang Wu, Yue Zhao, Yi Ma

TL;DR
This paper introduces LIFT, a method that uses fixed large language models to guide visual learning without joint training, outperforming traditional contrastive methods like CLIP in certain tasks while being more computationally efficient.
Contribution
The paper demonstrates that fixed LLM-based text encoders can effectively guide visual representation learning, challenging the need for joint training of text and image encoders.
Findings
LIFT outperforms CLIP in compositional understanding tasks.
LIFT achieves better results with lower computational costs.
Fixed LLM-based text embeddings are sufficient for effective visual alignment.
Abstract
Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
