Not All Documents Are What You Need for Extracting Instruction Tuning Data
Chi Zhang, Huaping Zhong, Hongtao Li, Chengliang Chai, Jiawei Hong, Yuhao Deng, Jiacheng Wang, Tian Tan, Yizhou Yan, Jiantao Qiu, Ye Yuan, Guoren Wang, Conghui He, Lei Cao

TL;DR
This paper introduces EQUAL, a scalable framework for extracting diverse, high-quality instruction tuning data from web corpora, significantly reducing costs and improving large language model performance.
Contribution
EQUAL combines clustering and multi-armed bandit strategies to efficiently identify valuable QA pairs, enhancing instruction data quality and diversity.
Findings
Reduces computational costs by 5-10x
Improves accuracy by 2.5% on LLaMA-3.1-8B and Mistral-7B
Effective across multiple downstream tasks
Abstract
Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper presents a clear research motivation, and the proposed EQUAL method achieves a favorable trade-off between cost and performance, providing valuable insights for future work. 2. The paper is well-organized, and the experiments in this paper cover three major datasets and multiple tasks, providing comprehensive evidence for the effectiveness of EQUAL. 3. The authors conducted comprehensive ablation studies on the proposed framework and provided thorough analyses of the extracted data
1. In lines 144–148, the authors emphasize that the main goal of this paper is to reduce the number of documents to be extracted in order to lower data construction costs while maintaining high model performance. However, the paper does not clearly show the performance gap between the instruction data extracted under high-cost settings and that obtained using the proposed method. This could be an important issue, as a large gap would call into question the practical significance of the proposed
1. The paper applied an innovative application of the Multi-Armed Bandit framework to sample from the most promising clusters iteratively. 2. The paper's key insight is interesting: don't cluster raw documents, cluster them by the data you can get out of them. 3. The paper is easy to follow and authors have done quite a lot of experiments including comprehensive downstream benchmarks. My core issues with this paper are that its impressive results seem to rest on a few critical assumptions that
My core issues with this paper are that its impressive results seem to rest on a few critical assumptions that don't hold up to scrutiny, making me question its real-world practicality. 1. The entire framework is anchored to the Optimal Transport distance to a reference set, D_r. The paper's claim (in Appendix P) that using just 20 reference samples is nearly as effective as 1500 is not credible to me. Estimating a stable OT distance for high-dimensional embeddings from 20 samples is statistical
1. The limitation pointed out by the work makes sense, and the proposed solution is tailored to fix them. In particular, the algorithm is reasonable which poses document selection and QA pair generation as a multi-bandit problem. 2. The paper also proposes a contrastive learning approach to ensure that the features extracted from the documents and extracted QA pairs are similar in the representation space while they are far apart for original documents and negative QA pairs. 3. The experimen
1. Line 46-48 argues that the synthetic data generation LLMs suffer from lack of diversity because of their proximity to the seed examples. The proposed method may also suffer from the same problem since the algorithm is optimized to reduce the optimal transport distance between the QA pairs from the documents and SEED datasets (MATH/GSM-8K or MBPP). 2. The choice of models and evaluation datasets is somewhat old. For instance, it is quite common to show some quantitative experiments on Qwen m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
