Data curation via joint example selection further accelerates multimodal   learning

Talfan Evans; Nikhil Parthasarathy; Hamza Merzic; Olivier J. Henaff

arXiv:2406.17711·cs.LG·June 26, 2024·1 cites

Data curation via joint example selection further accelerates multimodal learning

Talfan Evans, Nikhil Parthasarathy, Hamza Merzic, Olivier J. Henaff

PDF

Open Access 1 Video

TL;DR

This paper introduces JEST, a data curation method that jointly selects data batches for multimodal contrastive learning, significantly accelerating training and reducing computational costs by leveraging data dependencies and pretrained models.

Contribution

It proposes a novel joint data selection algorithm for multimodal contrastive learning, improving training efficiency and model performance with fewer iterations and less computation.

Findings

01

JEST surpasses state-of-the-art models with up to 13× fewer iterations.

02

JEST reduces computational costs by 10×.

03

Joint batch selection accelerates training beyond individual example prioritization.

Abstract

Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from larger super-batches, we also leverage recent advances in model approximation to reduce the associated computational overhead. As a result, our approach--multimodal contrastive learning with joint example selection (JEST)--surpasses state-of-the-art models with up to 13 $\times$ fewer iterations and 10 $\times$ less computation. Essential to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Data curation via joint example selection further accelerates multimodal learning· slideslive

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies

MethodsContrastive Learning