CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

Pavan Kumar Anasosalu Vasu; Hadi Pouransari; Fartash Faghri; Oncel; Tuzel

arXiv:2405.08911·cs.CV·May 16, 2024·2 cites

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Oncel, Tuzel

PDF

Open Access

TL;DR

Enhancing caption quality in image-text datasets significantly improves CLIP's visual representations, leading to superior performance on dense prediction tasks and increased data efficiency, surpassing recent pretraining methods.

Contribution

This work demonstrates that high-quality captions in pretraining datasets enhance CLIP's effectiveness on dense vision tasks, outperforming recent state-of-the-art pretraining approaches.

Findings

01

CLIP with quality captions surpasses recent pretraining methods on dense tasks.

02

Improved caption quality yields 12.1% higher mIoU and 11.5% lower RMSE.

03

Mobile architectures benefit similarly from CLIP pretraining, achieving competitive results.

Abstract

CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media

MethodsContrastive Language-Image Pre-training