Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

Istabrak Abbes; Gopeshh Subbaraj; Matthew Riemer; Nizar Islah; Benjamin Therien; Tsuguchika Tabaru; Hiroaki Kingetsu; Sarath Chandar; Irina Rish

arXiv:2508.01908·cs.LG·August 5, 2025

Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

Istabrak Abbes, Gopeshh Subbaraj, Matthew Riemer, Nizar Islah, Benjamin Therien, Tsuguchika Tabaru, Hiroaki Kingetsu, Sarath Chandar, Irina Rish

PDF

Open Access

TL;DR

This paper investigates continual pre-training of large language models, demonstrating that experience replay and gradient alignment improve stability and prevent forgetting across multiple languages and scales, with efficient implementations and insights on resource allocation.

Contribution

It introduces the first application of gradient alignment techniques in LLM pre-training and proposes an efficient meta-experience replay method to enhance continual learning.

Findings

01

Experience replay and gradient alignment improve stability without forgetting.

02

Small replay rates are more resource-efficient than increasing model size.

03

Gradient alignment techniques are effective in large-scale LLM pre-training.

Abstract

Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data becomes available. A more efficient and resource-conserving approach would be continual pre-training, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education