SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman

TL;DR
The paper introduces SmallToLarge (S2L), a scalable data selection method that uses small model training trajectories to efficiently select training data for large language models, significantly reducing data requirements while maintaining or improving performance.
Contribution
S2L is a novel, scalable data selection approach that leverages small models' training trajectories to improve data efficiency in fine-tuning large language models across domains.
Findings
Reduces training data to 11% for math problem-solving with maintained performance.
Outperforms state-of-the-art data selection algorithms by 4.7% on multiple datasets.
Achieves 32.7% accuracy on MATH benchmark using only 50K data points.
Abstract
Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsShrink and Fine-Tune
