One-Shot Learning as Instruction Data Prospector for Large Language   Models

Yunshui Li; Binyuan Hui; Xiaobo Xia; Jiaxi Yang; Min Yang; Lei Zhang,; Shuzheng Si; Ling-Hao Chen; Junhao Liu; Tongliang Liu; Fei Huang; Yongbin Li

arXiv:2312.10302·cs.CL·June 4, 2024·1 cites

One-Shot Learning as Instruction Data Prospector for Large Language Models

Yunshui Li, Binyuan Hui, Xiaobo Xia, Jiaxi Yang, Min Yang, Lei Zhang,, Shuzheng Si, Ling-Hao Chen, Junhao Liu, Tongliang Liu, Fei Huang, Yongbin Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces extsc{Nuggets}, a method that uses one-shot learning to select high-quality instruction data, improving large language model tuning by filtering out noise and enhancing performance.

Contribution

extsc{Nuggets} provides an efficient way to identify valuable instruction examples for tuning large language models, outperforming traditional data scaling methods.

Findings

01

Selective data improves model performance significantly.

02

Top 1 ext% of curated data outperforms full datasets.

03

Method effective across multiple benchmarks.

Abstract

Contemporary practices in instruction tuning often hinge on enlarging data scaling without a clear strategy for ensuring data quality, inadvertently introducing noise that may compromise model performance. To address this challenge, we introduce \textsc{Nuggets}, a novel and efficient methodology that leverages one-shot learning to discern and select high-quality instruction data from extensive datasets. \textsc{Nuggets} assesses the potential of individual instruction examples to act as effective one-shot learning instances, thereby identifying those that can significantly improve performance across diverse tasks. \textsc{Nuggets} utilizes a scoring system based on the impact of candidate examples on the perplexity of a diverse anchor set, facilitating the selection of the most advantageous data for instruction tuning. Through comprehensive evaluations on two benchmarks, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pldlgb/nuggets
pytorchOfficial

Videos

One-Shot Learning as Instruction Data Prospector for Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsALIGN