Instruction Mining: Instruction Data Selection for Tuning Large Language Models
Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun

TL;DR
This paper introduces InstructMining, a novel method for automatically selecting high-quality instruction data to enhance large language model finetuning, demonstrating state-of-the-art results on key benchmarks.
Contribution
The paper presents InstructMining, an innovative dataset selection technique utilizing natural language indicators, and reveals the double descent phenomenon in LLM finetuning.
Findings
InstructMining-7B achieves state-of-the-art performance on key benchmarks.
Double descent phenomenon observed in large language model finetuning.
BlendSearch effectively identifies optimal data subsets for finetuning.
Abstract
Large language models (LLMs) are initially pretrained for broad capabilities and then finetuned with instruction-following datasets to improve their performance in interacting with humans. Despite advances in finetuning, a standardized guideline for selecting high-quality datasets to optimize this process remains elusive. In this paper, we first propose InstructMining, an innovative method designed for automatically selecting premium instruction-following data for finetuning LLMs. Specifically, InstructMining utilizes natural language indicators as a measure of data quality, applying them to evaluate unseen datasets. During experimentation, we discover that double descent phenomenon exists in large language model finetuning. Based on this observation, we further leverage BlendSearch to help find the best subset among the entire dataset (i.e., 2,532 out of 100,000). Experiment results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
Methodsfail
