Rethinking Data Selection at Scale: Random Selection is Almost All You Need
Tingyu Xia, Bowen Yu, Kai Dang, An Yang, Yuan Wu, Yuan Tian, Yi Chang,, Junyang Lin

TL;DR
This paper demonstrates that for large-scale supervised fine-tuning of LLMs, simple random data selection often matches or exceeds complex methods, with token length filtering providing a stable improvement, especially for weaker models.
Contribution
The study shows that random selection is nearly as effective as advanced methods for large datasets and highlights the importance of data diversity and token length filtering in SFT.
Findings
Random selection performs comparably to complex methods on 2 million data points.
Diversity in data selection outweighs data quality in large-scale SFT.
Filtering data by token length improves fine-tuning results, especially for weaker models.
Abstract
Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods those that do not rely on external model assistance on two million scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBig Data and Business Intelligence · Statistics Education and Methodologies · Machine Learning and Data Classification
MethodsBalanced Selection · Shrink and Fine-Tune
