Large-Scale Diverse Synthesis for Mid-Training
Xuemiao Zhang, Chengying Tu, Can Ren, Rongxiang Weng, Hongfei Yan, Jingang Wang, and Xunliang Cai

TL;DR
This paper introduces BoostQA, a large-scale, diverse QA dataset synthesized for mid-training LLMs, significantly improving their performance across multiple benchmarks by enhancing domain-specific knowledge and data quality.
Contribution
The paper presents a novel diversified synthesis pipeline for creating a 100B-token QA dataset, BoostQA, tailored for mid-training to boost large language model performance.
Findings
BoostQA improves Llama-3 8B's performance by 12.74% on MMLU and CMMLU.
Mid-training with BoostQA achieves state-of-the-art results across 12 benchmarks.
Scalability tests show performance gains with increased model size and data volume.
Abstract
The scarcity of high-quality, knowledge-intensive training data hinders the development of large language models (LLMs), as traditional corpora provide limited information. Previous studies have synthesized and integrated corpora-dependent question-answering (QA) data to improve model performance but face challenges in QA data scalability and knowledge diversity, particularly in cross-domain contexts. Furthermore, leveraging our designed discipline and difficulty annotation system, we probe model deficiencies in STEM disciplines and high-difficulty data. To overcome these limitations, we propose a novel diversified pipeline to synthesize BoostQA, a 100B-token large-scale QA dataset. Our synthesis framework: (1) curates seed data from heterogeneous sources; (2) utilizes DeepSeek-R1 to implement STEM-focused multi-grade synthesis to boost data diversity and high-difficulty synthesis to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
