From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning

Haoyu Li; Xuhong Li; Yiming Dong; Kun Liu

arXiv:2505.24768·cs.CL·June 2, 2025

From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning

Haoyu Li, Xuhong Li, Yiming Dong, Kun Liu

PDF

1 Video

TL;DR

This paper systematically analyzes dataset diversity at macro, meso, and micro levels in language model fine-tuning, revealing how different diversity strategies impact model performance and offering guidance for constructing effective datasets.

Contribution

It introduces a taxonomy of diversity-control strategies across multiple levels and provides empirical evidence on their effects in LLM fine-tuning.

Findings

01

Microscopic diversity in responses correlates strongly with performance.

02

Maximum diversity across strategies yields the best model performance.

03

Macro- and mesoscopic strategies improve performance with increased diversity.

Abstract

Dataset diversity plays a pivotal role for the successful training of many machine learning models, particularly in the supervised fine-tuning (SFT) stage of large language model (LLM) development. Despite increasing recognition of its importance, systematic analyses of dataset diversity still remain underexplored. To address this gap, this work presents a systematic taxonomy of existing diversity-control strategies, which primarily focus on the instruction component, operating at either macroscopic (entire instruction semantics) or mesoscopic levels (instruction units), and furthermore introduces a novel analysis of microscopic diversity within the response component, specifically analyzing the statistical distribution of tokens in SFT training samples. In the experimental evaluation, we construct fixed-size datasets (e.g., 10,000 samples each) from a corpus of 117,000 open-source SFT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning· underline

Taxonomy

MethodsShrink and Fine-Tune · Focus