Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric

Yuming Yang; Yang Nan; Junjie Ye; Shihan Dou; Xiao Wang; Shuo Li; Huijie Lv; Mingqi Wu; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2502.17184·cs.CL·June 3, 2025

Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric

Yuming Yang, Yang Nan, Junjie Ye, Shihan Dou, Xiao Wang, Shuo Li, Huijie Lv, Mingqi Wu, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper systematically evaluates existing diversity metrics for instruction tuning of language models, introduces a new metric called NovelSum based on sample novelty, and demonstrates its effectiveness in guiding data selection to improve model performance.

Contribution

The paper provides a comprehensive analysis of diversity measurement methods, proposes a novel metric NovelSum, and develops a diversity-oriented data selection strategy that outperforms existing methods.

Findings

01

NovelSum correlates with model performance at 0.97

02

Diversity-oriented selection improves instruction tuning results

03

Properly measuring diversity enhances data engineering for language models

Abstract

Data diversity is crucial for the instruction tuning of large language models. Existing studies have explored various diversity-aware data selection methods to construct high-quality datasets and enhance model performance. However, the fundamental problem of precisely defining and measuring data diversity remains underexplored, limiting clear guidance for data engineering. To address this, we systematically analyze 11 existing diversity measurement methods by evaluating their correlation with model performance through extensive fine-tuning experiments. Our results indicate that a reliable diversity measure should properly account for both inter-sample differences and the information density in the sample space. Building on this, we propose NovelSum, a new diversity metric based on sample-level "novelty." Experiments on both simulated and real-world data show that NovelSum accurately…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

umeannever/novelsum
pytorchOfficial

Datasets

Sirius518/NovelSum
dataset· 754 dl
754 dl

Videos

Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric· underline

Taxonomy

TopicsMachine Learning and Data Classification · Data Quality and Management · Natural Language Processing Techniques