Rethinking Data Selection at Scale: Random Selection is Almost All You   Need

Tingyu Xia; Bowen Yu; Kai Dang; An Yang; Yuan Wu; Yuan Tian; Yi Chang,; Junyang Lin

arXiv:2410.09335·cs.CL·December 10, 2024·2 cites

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Tingyu Xia, Bowen Yu, Kai Dang, An Yang, Yuan Wu, Yuan Tian, Yi Chang,, Junyang Lin

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that for large-scale supervised fine-tuning of LLMs, simple random data selection often matches or exceeds complex methods, with token length filtering providing a stable improvement, especially for weaker models.

Contribution

The study shows that random selection is nearly as effective as advanced methods for large datasets and highlights the importance of data diversity and token length filtering in SFT.

Findings

01

Random selection performs comparably to complex methods on 2 million data points.

02

Diversity in data selection outweighs data quality in large-scale SFT.

03

Filtering data by token length improves fine-tuning results, especially for weaker models.

Abstract

Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods those that do not rely on external model assistance on two million scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiatingyu/sft-dataselection-at-scale
pytorchOfficial

Videos

Rethinking Data Selection at Scale: Random Selection is Almost All You Need· underline

Taxonomy

TopicsBig Data and Business Intelligence · Statistics Education and Methodologies · Machine Learning and Data Classification

MethodsBalanced Selection · Shrink and Fine-Tune