$\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization
Dylan Zhang, Justin Wang, Francois Charton

TL;DR
This paper demonstrates that diverse, cross-domain instruction data significantly improves large language models' ability to generalize to unseen tasks, emphasizing the importance of strategic data collection.
Contribution
It provides a rigorous analysis showing that instruction diversity across semantic domains is essential for model generalization, offering practical guidelines for dataset design.
Findings
Cross-domain diversification enhances generalization even with limited data.
Increasing data diversity improves performance more than simply increasing data quantity.
Diversification benefits both specialist and generalist models across tasks.
Abstract
Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model's adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of \textit{\textbf{specialist}} and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics Education and Teaching Techniques
