SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large   Language Models by Summarizing Training Trajectories of Small Models

Yu Yang; Siddhartha Mishra; Jeffrey N Chiang; Baharan Mirzasoleiman

arXiv:2403.07384·cs.CL·December 6, 2024·1 cites

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman

PDF

Open Access 1 Repo 1 Video

TL;DR

The paper introduces SmallToLarge (S2L), a scalable data selection method that uses small model training trajectories to efficiently select training data for large language models, significantly reducing data requirements while maintaining or improving performance.

Contribution

S2L is a novel, scalable data selection approach that leverages small models' training trajectories to improve data efficiency in fine-tuning large language models across domains.

Findings

01

Reduces training data to 11% for math problem-solving with maintained performance.

02

Outperforms state-of-the-art data selection algorithms by 4.7% on multiple datasets.

03

Achieves 32.7% accuracy on MATH benchmark using only 50K data points.

Abstract

Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bigml-cs-ucla/s2l
pytorchOfficial

Videos

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsShrink and Fine-Tune