Large-Scale Diverse Synthesis for Mid-Training

Xuemiao Zhang; Chengying Tu; Can Ren; Rongxiang Weng; Hongfei Yan; Jingang Wang; and Xunliang Cai

arXiv:2508.01326·cs.CL·August 5, 2025

Large-Scale Diverse Synthesis for Mid-Training

Xuemiao Zhang, Chengying Tu, Can Ren, Rongxiang Weng, Hongfei Yan, Jingang Wang, and Xunliang Cai

PDF

Open Access

TL;DR

This paper introduces BoostQA, a large-scale, diverse QA dataset synthesized for mid-training LLMs, significantly improving their performance across multiple benchmarks by enhancing domain-specific knowledge and data quality.

Contribution

The paper presents a novel diversified synthesis pipeline for creating a 100B-token QA dataset, BoostQA, tailored for mid-training to boost large language model performance.

Findings

01

BoostQA improves Llama-3 8B's performance by 12.74% on MMLU and CMMLU.

02

Mid-training with BoostQA achieves state-of-the-art results across 12 benchmarks.

03

Scalability tests show performance gains with increased model size and data volume.

Abstract

The scarcity of high-quality, knowledge-intensive training data hinders the development of large language models (LLMs), as traditional corpora provide limited information. Previous studies have synthesized and integrated corpora-dependent question-answering (QA) data to improve model performance but face challenges in QA data scalability and knowledge diversity, particularly in cross-domain contexts. Furthermore, leveraging our designed discipline and difficulty annotation system, we probe model deficiencies in STEM disciplines and high-difficulty data. To overcome these limitations, we propose a novel diversified pipeline to synthesize BoostQA, a 100B-token large-scale QA dataset. Our synthesis framework: (1) curates seed data from heterogeneous sources; (2) utilizes DeepSeek-R1 to implement STEM-focused multi-grade synthesis to boost data diversity and high-difficulty synthesis to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications