ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

Kaicheng Yang; Jiankang Deng; Xiang An; Jiawei Li; Ziyong Feng; Jia; Guo; Jing Yang; Tongliang Liu

arXiv:2308.08428·cs.CV·August 21, 2023·1 cites

ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia, Guo, Jing Yang, Tongliang Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

ALIP introduces an adaptive pre-training method that utilizes synthetic captions and dynamic weighting mechanisms to improve vision-language representations, reducing noise impact and enhancing downstream task performance.

Contribution

This work proposes ALIP, a novel adaptive pre-training framework that integrates synthetic captions and dynamic sample weighting to improve vision-language models.

Findings

01

Achieves state-of-the-art results on zero-shot image-text retrieval.

02

Effectively reduces noise impact during pre-training.

03

Enhances downstream task performance across various models and datasets.

Abstract

Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepglint/alip
pytorchOfficial

Videos

ALIP: Adaptive Language-Image Pre-Training with Synthetic Caption· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsOFA · Focus