What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs

Seyed Mahed Mousavi; Simone Alghisi; Giuseppe Riccardi

arXiv:2601.03858·cs.CL·January 8, 2026

What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs

Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi

PDF

Open Access

TL;DR

This paper investigates how continual pre-training of large language models affects factual knowledge and skills, revealing that loss reduction does not reliably indicate learning progress or knowledge consolidation.

Contribution

It introduces a controlled benchmark and diagnostic probes to measure knowledge dynamics during CPT, highlighting the misalignment between loss and actual learning.

Findings

01

Factual learning is unstable and non-monotonic during CPT.

02

Knowledge pathways reconfigure rapidly, causing forgetting and narrow acquisition windows.

03

Loss decreases monotonically, but learning progress and knowledge consolidation are inconsistent.

Abstract

Continual Pre-Training (CPT) is widely used for acquiring and updating factual knowledge in LLMs. This practice treats loss as a proxy for knowledge learning, while offering no grounding into how it changes during training. We study CPT as a knowledge learning process rather than a solely optimization problem. We construct a controlled, distribution-matched benchmark of factual documents and interleave diagnostic probes directly into the CPT loop, enabling epoch-level measurement of knowledge acquisition dynamics and changes in Out-Of-Domain (OOD) general skills (e.g., math). We further analyze how CPT reshapes knowledge circuits during training. Across three instruction-tuned LLMs and multiple CPT strategies, optimization and learning systematically diverge as loss decreases monotonically while factual learning is unstable and non-monotonic. Acquired facts are rarely consolidated,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMemory Processes and Influences · Domain Adaptation and Few-Shot Learning · Topic Modeling