Scaling Towards the Information Boundary of Instruction Sets: The Infinity Instruct Subject Technical Report
Li Du, Hanyu Zhao, Yiming Ju, Tengfei Pan

TL;DR
This paper introduces a systematic framework for constructing high-quality instruction datasets that expand coverage and depth, significantly improving large models' instruction-following abilities and generalization.
Contribution
It presents a novel iterative data construction framework integrating hierarchical tagging, seed selection, synthesis, and diagnosis, leading to the creation of the Infinity Instruct Subject dataset.
Findings
The dataset contains approximately 1.5 million instructions.
Models fine-tuned on this dataset show improved instruction-following performance.
Enhanced coverage and depth compared to existing datasets.
Abstract
Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both ``coverage'' (coverage of task types and knowledge areas) and ``depth'' (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical tagging system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTeaching and Learning Programming · Intelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics
