Aligning Instruction Tuning with Pre-training
Yiming Liang, Tianyu Zheng, Xinrun Du, Ge Zhang, Jiaheng Liu, Xingwei Qu, Wenqiang Zu, Xingrun Xing, Chujie Zheng, Lei Ma, Guoyin Wang, Zhaoxiang Zhang, Wenhao Huang, Xiang Yue, Jiajun Zhang

TL;DR
This paper introduces AITP, a method that improves instruction tuning of large language models by aligning datasets with pre-training data, leading to better generalization and performance across multiple benchmarks.
Contribution
The paper presents AITP, a novel approach that identifies and rewrites underrepresented pre-training data into instruction-response pairs to enhance dataset diversity and model performance.
Findings
Consistent performance improvements across three open LLMs and eight benchmarks.
Adaptive data selection and controlled rewriting significantly benefit model generalization.
Aligning instruction tuning datasets with pre-training data enhances LLM capabilities.
Abstract
Instruction tuning enhances large language models (LLMs) to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained knowledge. We propose Aligning Instruction Tuning with Pre-training (AITP), a method that bridges this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs. This approach enriches dataset diversity while preserving task-specific objectives. Evaluations on three fully open LLMs across eight benchmarks demonstrate consistent performance improvements with AITP. Ablations highlight the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducation and Technology Integration
