Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

Shaobo Wang; Xiangqi Jin; Ziming Wang; Jize Wang; Jiajun Zhang; Kaixin Li; Zichen Wen; Zhong Li; Conghui He; Xuming Hu; Linfeng Zhang

arXiv:2505.12212·cs.CL·June 3, 2025

Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

Shaobo Wang, Xiangqi Jin, Ziming Wang, Jize Wang, Jiajun Zhang, Kaixin Li, Zichen Wen, Zhong Li, Conghui He, Xuming Hu, Linfeng Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

Data Whisperer introduces a training-free, attention-based method that uses few-shot in-context learning to efficiently select optimal data subsets for fine-tuning large language models, reducing costs and improving performance.

Contribution

It presents a novel, training-free data selection approach leveraging in-context learning, outperforming existing methods in efficiency and effectiveness.

Findings

01

Achieves superior performance with only 10% of data compared to full datasets.

02

Outperforms existing data selection methods with a 3.1-point improvement.

03

Provides a 7.4× speedup in data selection process.

Abstract

Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model's predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gszfwsb/Data-Whisperer
pytorchOfficial

Videos

Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning· underline

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques