FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training
Anjia Cao, Xing Wei, Zhiheng Ma

TL;DR
FLAME leverages frozen large language models for data-efficient, multilingual language-image pre-training, effectively processing long texts and improving downstream performance over previous methods.
Contribution
The paper introduces FLAME, a novel approach using frozen large language models with prompt distillation and facet-decoupled attention for better language-image pre-training.
Findings
FLAME surpasses state-of-the-art by 4.9% in ImageNet top-1 accuracy.
Achieves 44.4% improvement in multilingual image-to-text recall.
Outperforms previous models in long-context retrieval tasks.
Abstract
Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders. While prevailing methods attempt to address these issues through data augmentation and architecture modifications, they continue to struggle with processing long-form text inputs, and the inherent limitations of traditional CLIP text encoders lead to suboptimal downstream generalization. In this paper, we propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders, naturally processing long text inputs and demonstrating impressive multilingual generalization. FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, which better aligns with the multifaceted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Radiomics and Machine Learning in Medical Imaging
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
