TL;DR
This paper systematically analyzes dataset diversity at macro, meso, and micro levels in language model fine-tuning, revealing how different diversity strategies impact model performance and offering guidance for constructing effective datasets.
Contribution
It introduces a taxonomy of diversity-control strategies across multiple levels and provides empirical evidence on their effects in LLM fine-tuning.
Findings
Microscopic diversity in responses correlates strongly with performance.
Maximum diversity across strategies yields the best model performance.
Macro- and mesoscopic strategies improve performance with increased diversity.
Abstract
Dataset diversity plays a pivotal role for the successful training of many machine learning models, particularly in the supervised fine-tuning (SFT) stage of large language model (LLM) development. Despite increasing recognition of its importance, systematic analyses of dataset diversity still remain underexplored. To address this gap, this work presents a systematic taxonomy of existing diversity-control strategies, which primarily focus on the instruction component, operating at either macroscopic (entire instruction semantics) or mesoscopic levels (instruction units), and furthermore introduces a novel analysis of microscopic diversity within the response component, specifically analyzing the statistical distribution of tokens in SFT training samples. In the experimental evaluation, we construct fixed-size datasets (e.g., 10,000 samples each) from a corpus of 117,000 open-source SFT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsShrink and Fine-Tune · Focus
