Revealing the Learning Dynamics of Long-Context Continual Pre-training

Yupu Liang; Shuang Chen; Guanwei Zhang; Shaolei Wang; Suncong Zheng

arXiv:2604.02650·cs.CL·April 6, 2026

Revealing the Learning Dynamics of Long-Context Continual Pre-training

Yupu Liang, Shuang Chen, Guanwei Zhang, Shaolei Wang, Suncong Zheng

PDF

TL;DR

This paper systematically investigates the learning dynamics of large-scale long-context continual pre-training (LCCP) on an 80B parameter model, revealing insights into data scaling, saturation detection, and mechanistic monitoring for industrial LLMs.

Contribution

It introduces a hierarchical analysis framework for LCCP dynamics on industrial-grade models, highlighting the importance of massive data, revealing deceptive saturation issues, and proposing mechanistic monitoring tools.

Findings

01

150B tokens needed for saturation in industrial LLMs

02

PPL-based analysis reveals continuous improvements beyond early saturation

03

Retrieval head attention scores reliably track LCCP progress

Abstract

Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation and premature training termination. Furthermore, current evaluation methods rely heavily on downstream benchmarks (e.g., Needle-in-a-Haystack), which often fail to reflect the intrinsic convergence state and can lead to "deceptive saturation". In this paper, we present the first systematic investigation of LCCP learning dynamics using the industrial-grade Hunyuan-A13B (80B total parameters), tracking its evolution across a 200B-token training trajectory. Specifically, we propose a hierarchical framework to analyze LCCP dynamics across behavioral (supervised fine-tuning probing), probabilistic (perplexity), and mechanistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.