Curriculum Learning with Quality-Driven Data Selection

Biao Wu; Ling Chen

arXiv:2407.00102·cs.LG·June 3, 2025

Curriculum Learning with Quality-Driven Data Selection

Biao Wu, Ling Chen

PDF

Open Access

TL;DR

This paper introduces a novel curriculum learning approach for multimodal large language models that uses image-text correlation and model perplexity to select high-quality data, improving model capabilities efficiently.

Contribution

It proposes a new data selection method based on a two-dimensional quality space, enabling better control and curriculum learning in multimodal model training.

Findings

01

Significant improvements in five key capabilities over baseline datasets.

02

Effective data quality evaluation using image-text correlation and perplexity.

03

Enhanced training efficiency through multi-stage data subsets.

Abstract

The impressive multimodal capabilities demonstrated by OpenAI's GPT-4 have generated significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs with machine-generated instruction-following data has shown to enhance zero-shot capabilities across various tasks. However, there has been limited exploration into controlling the quality of the instruction data.Current methodologies for data selection in MLLMs often rely on single, unreliable scores or use downstream tasks for selection, which is time-consuming and can lead to potential overfitting on the chosen evaluation datasets. To mitigate these limitations, we propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality. This approach leverages the distinct distribution of these two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms