Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, Chang Zhou

TL;DR
This paper introduces DiverseEvol, a self-evolving data sampling method that enables LLMs to efficiently select diverse training data subsets, reducing annotation costs while maintaining or improving instruction-following performance.
Contribution
The paper presents a novel self-evolving data sampling technique that enhances diversity in instruction data, allowing models to effectively fine-tune on significantly less data without human intervention.
Findings
Models trained on less than 8% of data perform comparably or better.
Diversity in data sampling improves instruction-following ability.
Iterative sampling outperforms one-time data selection.
Abstract
Enhancing the instruction-following ability of Large Language Models (LLMs) primarily demands substantial instruction-tuning datasets. However, the sheer volume of these imposes a considerable computational burden and annotation cost. To investigate a label-efficient instruction tuning method that allows the model itself to actively sample subsets that are equally or even more effective, we introduce a self-evolving mechanism DiverseEvol. In this process, a model iteratively augments its training subset to refine its own performance, without requiring any intervention from humans or more advanced LLMs. The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets, as the model selects new data points most distinct from any existing ones according to its current embedding space. Extensive experiments across three datasets and benchmarks demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning and Data Classification
