Curriculum Learning for Data-Efficient Vision-Language Alignment

Tejas Srinivasan; Xiang Ren; Jesse Thomason

arXiv:2207.14525·cs.CV·August 1, 2022·1 cites

Curriculum Learning for Data-Efficient Vision-Language Alignment

Tejas Srinivasan, Xiang Ren, Jesse Thomason

PDF

Open Access

TL;DR

This paper introduces TOnICS, a curriculum learning method that efficiently aligns pre-trained vision and language models with minimal paired data, improving zero-shot image retrieval performance.

Contribution

It presents a novel curriculum learning approach for fine-grained vision-language alignment using pre-trained models and less data than traditional contrastive methods.

Findings

01

Outperforms CLIP on zero-shot image retrieval

02

Uses less than 1% of training data compared to traditional methods

03

Enables effective alignment with smaller datasets

Abstract

Aligning image and text encoders from scratch using contrastive learning requires large amounts of paired image-text data. We alleviate this need by aligning individually pre-trained language and vision representation models using a much smaller amount of paired data, augmented with a curriculum learning algorithm to learn fine-grained vision-language alignments. TOnICS (Training with Ontology-Informed Contrastive Sampling) initially samples minibatches whose image-text pairs contain a wide variety of objects to learn object-level alignment, and progressively samples minibatches where all image-text pairs contain the same object to learn finer-grained contextual alignment. Aligning pre-trained BERT and VinVL models to each other using TOnICS outperforms CLIP on downstream zero-shot image retrieval while using less than 1% as much training data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Dropout · Dense Connections · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Adam