ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia, Guo, Jing Yang, Tongliang Liu

TL;DR
ALIP introduces an adaptive pre-training method that utilizes synthetic captions and dynamic weighting mechanisms to improve vision-language representations, reducing noise impact and enhancing downstream task performance.
Contribution
This work proposes ALIP, a novel adaptive pre-training framework that integrates synthetic captions and dynamic sample weighting to improve vision-language models.
Findings
Achieves state-of-the-art results on zero-shot image-text retrieval.
Effectively reduces noise impact during pre-training.
Enhances downstream task performance across various models and datasets.
Abstract
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
ALIP: Adaptive Language-Image Pre-Training with Synthetic Caption· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsOFA · Focus
