Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning
Simran Kaur, Simon Park, Anirudh Goyal, Sanjeev Arora

TL;DR
Instruct-SkillMix presents an automated, cost-effective pipeline leveraging LLMs to extract skills and generate diverse instruction-following data, significantly improving model performance on benchmarks without complex training methods.
Contribution
The paper introduces a novel two-stage pipeline for creating high-quality instruction tuning data using LLMs, enhancing instruction-following performance efficiently.
Findings
Strong gains on instruction benchmarks with minimal data.
Cost-effective dataset creation under $600.
Adding low-quality answers degrades performance.
Abstract
We introduce Instruct-SkillMix, an automated approach for creating diverse, high quality SFT data for instruction-following. The pipeline involves two stages, each leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to extract core "skills" for instruction-following by directly prompting the model. This is inspired by ``LLM metacognition'' of Didolkar et al. (2024); (2) Data generation: uses the powerful LLM to generate (instruction, response) data that exhibit a randomly chosen pair of these skills. Here, the use of random skill combinations promotes diversity and difficulty. The estimated cost of creating the dataset is under $600. Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from Instruct-SkillMix leads to strong gains on instruction following benchmarks such as AlpacaEval 2.0, MT-Bench, and WildBench. With just 4K examples, LLaMA-3-8B-Base…
Peer Reviews
Decision·ICLR 2025 Poster
The paper presents a novel approach to synthetic data generation that achieves strong results with only 4K examples, suggesting an efficient path forward for instruction tuning. The empirical validation is well-designed, testing across multiple benchmarks and models while including careful ablation studies that isolate the effects of different components. The method is cost-effective, requiring only about $600 compared to traditional human annotation approaches. The authors provide some anal
The paper's most significant limitation is the performance plateau at 4K examples, with no clear explanation or analysis of learning curves as dataset size increases. This is compounded by limited investigation of whether different architectures or model sizes might hit different ceilings. The evaluation methodology relies heavily on AlpacaEval 2.0 and lacks assessment of long-form generation and multi-turn conversations. The use of both teacher and grader models from the same model family (GP
- This paper shows very strong performance on benchmarks where LLMs are used as a judge. - The InstructSkillMix framework is novel and interesting. Moreover, it does not require any seed data, which is beneficial.
- The baseline methods are not fair: the main comparison is to Alpaca 52K, which is really old and known to be a low quality dataset. I think the authors should try comparing their dataset to stronger datasets such as ShareGPT with the responses regenerated by GPT4-Turbo. - In my opinion, section 1.1 is somewhat misleading. The authors (in line 70-75) say it is a mystery why public instruction tuning does not match the performance of proprietary instruct models. However, these proprietary models
- It finds that directly prompting a strong LLM to identify crucial skills achieves better performance than extracting skills from existing IFT datasets. - The performance of using merely thousands of Instruct-SkillMix data is impressive. - The data generation pipeline is fully automated and has nearly no human intervention. - It conducts detailed ablation studies and shows the contributions of different components. - It reveals that even a small amount of low-quality data greatly harms the inst
- The type of queries and topics could be relevant to coverage of data. I think it might be worth to do ablation study on the query and topic types.
Code & Models
Videos
Taxonomy
TopicsExperimental Learning in Engineering · Educational Technology and Assessment
MethodsDirect Preference Optimization · Entropy Regularization · Shrink and Fine-Tune · Proximal Policy Optimization
