Caption supervision enables robust learners

Benjamin Feuer; Ameya Joshi; Chinmay Hegde

arXiv:2210.07396·cs.CV·December 9, 2022·1 cites

Caption supervision enables robust learners

Benjamin Feuer, Ameya Joshi, Chinmay Hegde

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that caption-supervised CNNs trained with standard cross-entropy can outperform vision-language models like CLIP in distributional robustness, and introduces CaptionNet, a new dataset for future research.

Contribution

It shows caption-supervised CNNs can be more robust than VL models and provides a new dataset, CaptionNet, for advancing caption supervision research.

Findings

01

Caption-supervised CNNs can outperform VL models in robustness.

02

Choice of loss function and supervision strategy impacts robustness.

03

Introduction of CaptionNet dataset with 50,000+ human-labeled samples.

Abstract

Vision language (VL) models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that caption-supervised CNNs trained on a standard cross-entropy loss (with image labels assigned by scanning captions for class names) can exhibit greater distributional robustness than VL models trained on the same data. To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet (https://github.com/penfever/CaptionNet/), which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

penfever/vlhub
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training