Get more for less: Principled Data Selection for Warming Up Fine-Tuning   in LLMs

Feiyang Kang; Hoang Anh Just; Yifan Sun; Himanshu Jahagirdar; Yuanzhi; Zhang; Rongxing Du; Anit Kumar Sahu; Ruoxi Jia

arXiv:2405.02774·cs.LG·May 7, 2024·3 cites

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Feiyang Kang, Hoang Anh Just, Yifan Sun, Himanshu Jahagirdar, Yuanzhi, Zhang, Rongxing Du, Anit Kumar Sahu, Ruoxi Jia

PDF

Open Access

TL;DR

This paper introduces a data selection method that improves fine-tuning of large language models by choosing data that aligns the pre-training distribution closer to the target, reducing costs and enhancing performance.

Contribution

It proposes a novel data selection approach focused on distribution alignment for pre-fine-tuning, outperforming existing methods in efficiency and effectiveness.

Findings

01

Outperforms other selection methods across multiple tasks and models.

02

Significantly faster, scaling to millions of samples within an hour.

03

Enhances cost-effectiveness of fine-tuning large language models.

Abstract

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning and Data Classification · Mineral Processing and Grinding