CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Yanqing Liu; Xianhang Li; Zeyu Wang; Bingchen Zhao; Cihang Xie

arXiv:2411.16828·cs.CV·November 27, 2024

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, Cihang Xie

PDF

Open Access 3 Models

TL;DR

This paper introduces CLIPS, an improved framework for vision-language pretraining that leverages partial synthetic captions and autoregressive recaptioning to enhance zero-shot retrieval and multimodal tasks, achieving state-of-the-art results.

Contribution

The paper proposes two novel techniques—using partial synthetic captions and an autoregressive captioner—to better utilize synthetic data in vision-language pretraining.

Findings

01

Significant improvement in zero-shot cross-modal retrieval performance.

02

State-of-the-art results on MSCOCO and Flickr30K datasets.

03

Enhanced visual capabilities in LLaVA with the trained encoders.

Abstract

Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions. Firstly, by observing a strong inverse effect in learning with synthetic captions -- the short synthetic captions can generally lead to MUCH higher performance than full-length ones -- we therefore fed only partial synthetic captions to the text encoder. Secondly, we incorporate an autoregressive captioner to mimic the recaptioning process -- by conditioning on the paired image input and web-crawled text description, the captioner learns to predict the full-length synthetic caption generated by advanced MLLMs. Experiments show that our framework significantly improves zero-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Translation Studies and Practices · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training