A Survey on LLM Mid-Training

Chengying Tu; Xuemiao Zhang; Rongxiang Weng; Rumei Li; Chen Zhang; Yang Bai; Hongfei Yan; Jingang Wang; Xunliang Cai

arXiv:2510.23081·cs.CL·November 5, 2025

A Survey on LLM Mid-Training

Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, Xunliang Cai

PDF

TL;DR

This survey explores the mid-training stage in large language models, emphasizing its role in enhancing capabilities like reasoning and coding through intermediate data and strategies, and provides a formal framework and taxonomy for this crucial phase.

Contribution

It formally defines mid-training for LLMs, analyzes optimization frameworks, and offers a comprehensive taxonomy and insights to guide future research.

Findings

01

Mid-training enhances specific capabilities such as reasoning and coding.

02

Optimization strategies during mid-training significantly impact model performance.

03

A formal framework and taxonomy for mid-training stages are established.

Abstract

Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.