SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning
Hugo Hazard, Zafeirios Fountas, Martin A. Benfeghoul, Adnan Oomerjee, Jun Wang, Haitham Bou-Ammar

TL;DR
This paper introduces SuRe, a surprise-driven prioritised replay method for continual learning in large language models, combining surprise-based selection with a dual-learner design to improve knowledge retention and adaptation.
Contribution
It proposes a novel surprise-based replay strategy and a dual-learner architecture with EMA merging, significantly enhancing continual learning performance in LLMs.
Findings
SuRe achieves state-of-the-art results in large number of tasks setting.
The dual-learner design improves stability and adaptation speed.
Method remains effective with reduced replay frequency and small buffers.
Abstract
Continual learning, one's ability to adapt to a sequence of tasks without forgetting previously acquired knowledge, remains a major challenge in machine learning and a key gap between artificial and human intelligence. While regularisation and replay perform well in vision, they lag behind multi-task learning for large language models (LLMs), especially at scale with many tasks. We revisit replay and argue that two failure modes drive this gap: selection (what to rehearse) and integration (how to consolidate new knowledge). To address selection, we propose Surprise-prioritised Replay (SuRe), a simple, architecture-agnostic rule that ranks and stores the most surprising (high Negative Log-Likelihood) sequences. SuRe achieves state-of-the-art performance in the Large Number of Tasks (LNT) setting and delivers the best overall average across both Standard CL and LNT benchmarks. To address…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well-written, claims are intuitive, and theoretically and empirically validated. - To the best of my knowledge, the paper is the first to propose NLL for sample selection. Combining this with a slow learning strategy empirically shows significant improvements, as shown in Table 1. - SuRE is implemented using LoRA and is therefore architecture agnostic. - Evaluations and ablations are sufficient, and SuRE outperforms SOTA continual LLM learners on both benchmarks.
- To my knowledge, there are no significant weaknesses.
1. The paper is well motivated It is an interesting idea to select surprising samples for replay. 2. The method is straightforward 3. The decomposition of forgetting is interesting
1. The surprise measure might not be reliable 2. The idea of dual-learner is not novel, and the implementation seems confusing 3. The comparison is not sufficient and up-to-date Please see details in the Question section.
I identified the following strengths of this paper: - Theoretically, it provides an upper-bound on the forgetting experienced by a buffer-based replay method. I think the way the quality of the buffer is defined (D_{F_loc}(P_{1:T}, q)) is novel and might be interesting for others. The selection term and consolidation term being complementary is an important contribution. This was later backed up by experimental results. - Experimentally, it provides evidence that: 1) Replay-based with reservoi
Aside from the theoretical contribution, the rest of the methodology section has limited novelty - it appears to combine two already established ideas. Moreover, the text never makes it clear (from what I could see) how each component reduces the terms in the upper-bound. Readability: I don’t think that the integration (consolidation) term in Eq. 3 is well explained. Reading the main text, I do not have a good idea of what $B(\psi)$ is, apart from it being a “mechanism-specific factor”. The pa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling
