Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric
Yuming Yang, Yang Nan, Junjie Ye, Shihan Dou, Xiao Wang, Shuo Li, Huijie Lv, Mingqi Wu, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR
This paper systematically evaluates existing diversity metrics for instruction tuning of language models, introduces a new metric called NovelSum based on sample novelty, and demonstrates its effectiveness in guiding data selection to improve model performance.
Contribution
The paper provides a comprehensive analysis of diversity measurement methods, proposes a novel metric NovelSum, and develops a diversity-oriented data selection strategy that outperforms existing methods.
Findings
NovelSum correlates with model performance at 0.97
Diversity-oriented selection improves instruction tuning results
Properly measuring diversity enhances data engineering for language models
Abstract
Data diversity is crucial for the instruction tuning of large language models. Existing studies have explored various diversity-aware data selection methods to construct high-quality datasets and enhance model performance. However, the fundamental problem of precisely defining and measuring data diversity remains underexplored, limiting clear guidance for data engineering. To address this, we systematically analyze 11 existing diversity measurement methods by evaluating their correlation with model performance through extensive fine-tuning experiments. Our results indicate that a reliable diversity measure should properly account for both inter-sample differences and the information density in the sample space. Building on this, we propose NovelSum, a new diversity metric based on sample-level "novelty." Experiments on both simulated and real-world data show that NovelSum accurately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Data Quality and Management · Natural Language Processing Techniques
