Revealing the Learning Dynamics of Long-Context Continual Pre-training
Yupu Liang, Shuang Chen, Guanwei Zhang, Shaolei Wang, Suncong Zheng

TL;DR
This paper systematically investigates the learning dynamics of large-scale long-context continual pre-training (LCCP) on an 80B parameter model, revealing insights into data scaling, saturation detection, and mechanistic monitoring for industrial LLMs.
Contribution
It introduces a hierarchical analysis framework for LCCP dynamics on industrial-grade models, highlighting the importance of massive data, revealing deceptive saturation issues, and proposing mechanistic monitoring tools.
Findings
150B tokens needed for saturation in industrial LLMs
PPL-based analysis reveals continuous improvements beyond early saturation
Retrieval head attention scores reliably track LCCP progress
Abstract
Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation and premature training termination. Furthermore, current evaluation methods rely heavily on downstream benchmarks (e.g., Needle-in-a-Haystack), which often fail to reflect the intrinsic convergence state and can lead to "deceptive saturation". In this paper, we present the first systematic investigation of LCCP learning dynamics using the industrial-grade Hunyuan-A13B (80B total parameters), tracking its evolution across a 200B-token training trajectory. Specifically, we propose a hierarchical framework to analyze LCCP dynamics across behavioral (supervised fine-tuning probing), probabilistic (perplexity), and mechanistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
