Curriculum Learning for Data-Efficient Vision-Language Alignment
Tejas Srinivasan, Xiang Ren, Jesse Thomason

TL;DR
This paper introduces TOnICS, a curriculum learning method that efficiently aligns pre-trained vision and language models with minimal paired data, improving zero-shot image retrieval performance.
Contribution
It presents a novel curriculum learning approach for fine-grained vision-language alignment using pre-trained models and less data than traditional contrastive methods.
Findings
Outperforms CLIP on zero-shot image retrieval
Uses less than 1% of training data compared to traditional methods
Enables effective alignment with smaller datasets
Abstract
Aligning image and text encoders from scratch using contrastive learning requires large amounts of paired image-text data. We alleviate this need by aligning individually pre-trained language and vision representation models using a much smaller amount of paired data, augmented with a curriculum learning algorithm to learn fine-grained vision-language alignments. TOnICS (Training with Ontology-Informed Contrastive Sampling) initially samples minibatches whose image-text pairs contain a wide variety of objects to learn object-level alignment, and progressively samples minibatches where all image-text pairs contain the same object to learn finer-grained contextual alignment. Aligning pre-trained BERT and VinVL models to each other using TOnICS outperforms CLIP on downstream zero-shot image retrieval while using less than 1% as much training data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Dropout · Dense Connections · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Adam
