Instruction Mining: Instruction Data Selection for Tuning Large Language   Models

Yihan Cao; Yanbin Kang; Chi Wang; Lichao Sun

arXiv:2307.06290·cs.CL·July 30, 2024·2 cites

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun

PDF

Open Access

TL;DR

This paper introduces InstructMining, a novel method for automatically selecting high-quality instruction data to enhance large language model finetuning, demonstrating state-of-the-art results on key benchmarks.

Contribution

The paper presents InstructMining, an innovative dataset selection technique utilizing natural language indicators, and reveals the double descent phenomenon in LLM finetuning.

Findings

01

InstructMining-7B achieves state-of-the-art performance on key benchmarks.

02

Double descent phenomenon observed in large language model finetuning.

03

BlendSearch effectively identifies optimal data subsets for finetuning.

Abstract

Large language models (LLMs) are initially pretrained for broad capabilities and then finetuned with instruction-following datasets to improve their performance in interacting with humans. Despite advances in finetuning, a standardized guideline for selecting high-quality datasets to optimize this process remains elusive. In this paper, we first propose InstructMining, an innovative method designed for automatically selecting premium instruction-following data for finetuning LLMs. Specifically, InstructMining utilizes natural language indicators as a measure of data quality, applying them to evaluate unseen datasets. During experimentation, we discover that double descent phenomenon exists in large language model finetuning. Based on this observation, we further leverage BlendSearch to help find the best subset among the entire dataset (i.e., 2,532 out of 100,000). Experiment results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

Methodsfail