Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models
Jijie Li, Li Du, Hanyu Zhao, Bo-wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, Yonghua Lin

TL;DR
Infinity-Instruct is a large, high-quality instruction dataset that significantly improves the performance of open-source LLMs in both foundational and chat tasks through a novel two-phase data synthesis pipeline.
Contribution
The paper introduces Infinity-Instruct, a comprehensive instruction dataset created via a two-phase pipeline, enhancing LLM training and bridging the gap with proprietary models.
Findings
Models fine-tuned on Infinity-Instruct outperform counterparts on instruction-following benchmarks.
InfInstruct-LLaMA3.1-70B surpasses GPT-4-0314 by 8.6% on instruction tasks.
The dataset improves both foundational and chat capabilities of LLMs.
Abstract
Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
