SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning

Hugo Hazard; Zafeirios Fountas; Martin A. Benfeghoul; Adnan Oomerjee; Jun Wang; Haitham Bou-Ammar

arXiv:2511.22367·cs.LG·December 1, 2025

SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning

Hugo Hazard, Zafeirios Fountas, Martin A. Benfeghoul, Adnan Oomerjee, Jun Wang, Haitham Bou-Ammar

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SuRe, a surprise-driven prioritised replay method for continual learning in large language models, combining surprise-based selection with a dual-learner design to improve knowledge retention and adaptation.

Contribution

It proposes a novel surprise-based replay strategy and a dual-learner architecture with EMA merging, significantly enhancing continual learning performance in LLMs.

Findings

01

SuRe achieves state-of-the-art results in large number of tasks setting.

02

The dual-learner design improves stability and adaptation speed.

03

Method remains effective with reduced replay frequency and small buffers.

Abstract

Continual learning, one's ability to adapt to a sequence of tasks without forgetting previously acquired knowledge, remains a major challenge in machine learning and a key gap between artificial and human intelligence. While regularisation and replay perform well in vision, they lag behind multi-task learning for large language models (LLMs), especially at scale with many tasks. We revisit replay and argue that two failure modes drive this gap: selection (what to rehearse) and integration (how to consolidate new knowledge). To address selection, we propose Surprise-prioritised Replay (SuRe), a simple, architecture-agnostic rule that ranks and stores the most surprising (high Negative Log-Likelihood) sequences. SuRe achieves state-of-the-art performance in the Large Number of Tasks (LNT) setting and delivers the best overall average across both Standard CL and LNT benchmarks. To address…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

- The paper is well-written, claims are intuitive, and theoretically and empirically validated. - To the best of my knowledge, the paper is the first to propose NLL for sample selection. Combining this with a slow learning strategy empirically shows significant improvements, as shown in Table 1. - SuRE is implemented using LoRA and is therefore architecture agnostic. - Evaluations and ablations are sufficient, and SuRE outperforms SOTA continual LLM learners on both benchmarks.

Weaknesses

- To my knowledge, there are no significant weaknesses.

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper is well motivated It is an interesting idea to select surprising samples for replay. 2. The method is straightforward 3. The decomposition of forgetting is interesting

Weaknesses

1. The surprise measure might not be reliable 2. The idea of dual-learner is not novel, and the implementation seems confusing 3. The comparison is not sufficient and up-to-date Please see details in the Question section.

Reviewer 03Rating 4Confidence 4

Strengths

I identified the following strengths of this paper: - Theoretically, it provides an upper-bound on the forgetting experienced by a buffer-based replay method. I think the way the quality of the buffer is defined (D_{F_loc}(P_{1:T}, q)) is novel and might be interesting for others. The selection term and consolidation term being complementary is an important contribution. This was later backed up by experimental results. - Experimentally, it provides evidence that: 1) Replay-based with reservoi

Weaknesses

Aside from the theoretical contribution, the rest of the methodology section has limited novelty - it appears to combine two already established ideas. Moreover, the text never makes it clear (from what I could see) how each component reduces the terms in the upper-bound. Readability: I don’t think that the integration (consolidation) term in Eq. 3 is well explained. Reading the main text, I do not have a good idea of what $B(\psi)$ is, apart from it being a “mechanism-specific factor”. The pa

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling