Small Language Model as Data Prospector for Large Language Model

Shiwen Ni; Haihong Wu; Di Yang; Qiang Qu; Hamid Alinejad-Rokny; Min; Yang

arXiv:2412.09990·cs.CL·December 16, 2024

Small Language Model as Data Prospector for Large Language Model

Shiwen Ni, Haihong Wu, Di Yang, Qiang Qu, Hamid Alinejad-Rokny, Min, Yang

PDF

TL;DR

This paper introduces SuperNUGGETS, an efficient data filtering method using a small language model to select high-quality instruction data for fine-tuning large language models, achieving comparable performance with much lower resource use.

Contribution

SuperNUGGETS improves data selection efficiency for LLM fine-tuning by replacing a large model with a small model, reducing resource consumption while maintaining performance.

Findings

01

Performance decreases by only 1-2% compared to NUGGETS.

02

Efficiency increases by a factor of 58.

03

Higher utility value due to lower resource consumption.

Abstract

The quality of instruction data directly affects the performance of fine-tuned Large Language Models (LLMs). Previously, \cite{li2023one} proposed \texttt{NUGGETS}, which identifies and selects high-quality quality data from a large dataset by identifying those individual instruction examples that can significantly improve the performance of different tasks after being learnt as one-shot instances. In this work, we propose \texttt{SuperNUGGETS}, an improved variant of \texttt{NUGGETS} optimised for efficiency and performance. Our \texttt{SuperNUGGETS} uses a small language model (SLM) instead of a large language model (LLM) to filter the data for outstanding one-shot instances and refines the predefined set of tests. The experimental results show that the performance of \texttt{SuperNUGGETS} only decreases by 1-2% compared to \texttt{NUGGETS}, but the efficiency can be increased by a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training