Take the essence and discard the dross: A Rethinking on Data Selection   for Fine-Tuning Large Language Models

Ziche Liu; Rui Ke; Yajiao Liu; Feng Jiang; Haizhou Li

arXiv:2406.14115·cs.CL·February 25, 2025

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Ziche Liu, Rui Ke, Yajiao Liu, Feng Jiang, Haizhou Li

PDF

Open Access 1 Video

TL;DR

This paper reviews recent data selection techniques for fine-tuning large language models, proposing a unified framework and comparison metrics to evaluate their efficiency and feasibility, and discusses future research challenges.

Contribution

It introduces a three-stage scheme for categorizing data selection methods and a unified comparison approach addressing experimental inconsistencies.

Findings

01

Targeted quality measurement improves efficiency

02

Trade-off between efficiency and feasibility in methods

03

Identifies key challenges and future directions

Abstract

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research lacks a clear, unified framework, and the variability in experimental settings complicates systematic comparisons. While existing surveys comprehensively overview the stages and methods of data selection, they often overlook an in-depth exploration of the fine-tuning phase. In this paper, we conduct a focused review of recent data selection techniques for fine-tuning LLMs, analyzing a dozen key studies. We introduce a novel three-stage scheme - comprising feature extraction, criteria design, and selector evaluation - to systematically categorize and evaluate these methods. Additionally, we propose a unified comparison approach that incorporates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling