What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs
Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi

TL;DR
This paper investigates how continual pre-training of large language models affects factual knowledge and skills, revealing that loss reduction does not reliably indicate learning progress or knowledge consolidation.
Contribution
It introduces a controlled benchmark and diagnostic probes to measure knowledge dynamics during CPT, highlighting the misalignment between loss and actual learning.
Findings
Factual learning is unstable and non-monotonic during CPT.
Knowledge pathways reconfigure rapidly, causing forgetting and narrow acquisition windows.
Loss decreases monotonically, but learning progress and knowledge consolidation are inconsistent.
Abstract
Continual Pre-Training (CPT) is widely used for acquiring and updating factual knowledge in LLMs. This practice treats loss as a proxy for knowledge learning, while offering no grounding into how it changes during training. We study CPT as a knowledge learning process rather than a solely optimization problem. We construct a controlled, distribution-matched benchmark of factual documents and interleave diagnostic probes directly into the CPT loop, enabling epoch-level measurement of knowledge acquisition dynamics and changes in Out-Of-Domain (OOD) general skills (e.g., math). We further analyze how CPT reshapes knowledge circuits during training. Across three instruction-tuned LLMs and multiple CPT strategies, optimization and learning systematically diverge as loss decreases monotonically while factual learning is unstable and non-monotonic. Acquired facts are rarely consolidated,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMemory Processes and Influences · Domain Adaptation and Few-Shot Learning · Topic Modeling
