CiT: Curation in Training for Effective Vision-Language Data

Hu Xu; Saining Xie; Po-Yao Huang; Licheng Yu; Russell Howes; Gargi; Ghosh; Luke Zettlemoyer; Christoph Feichtenhofer

arXiv:2301.02241·cs.CV·January 6, 2023

CiT: Curation in Training for Effective Vision-Language Data

Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi, Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

PDF

Open Access 1 Repo

TL;DR

CiT introduces an efficient training method for vision-language models that automatically curates high-quality data from large pools, significantly speeding up training without extensive offline filtering.

Contribution

The paper proposes a novel data curation algorithm integrated into training, reducing data filtering costs and enabling faster, scalable vision-language model training.

Findings

01

Speeds up training by over an order of magnitude.

02

Effectively utilizes raw web data without offline filtering.

03

Maintains competitive performance with less data filtering effort.

Abstract

Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/cit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings