Forget Forgetting: Continual Learning in a World of Abundant Memory
Dongkyu Cho, Taesup Moon, Rumi Chunara, Kyunghyun Cho, Sungmin Cha

TL;DR
This paper explores continual learning in scenarios where memory is abundant but retraining is costly, shifting focus from stability to plasticity, and proposes a lightweight method that outperforms existing approaches with lower computational costs.
Contribution
It introduces Weight Space Consolidation, a novel, low-cost method combining parameter resets and weight averaging to improve continual learning in practical, resource-constrained settings.
Findings
Replay-based methods outperform state-of-the-art in low-cost regimes.
Weight Space Consolidation improves plasticity and stability balance.
Approach is validated on image classifiers and large language models.
Abstract
Continual learning (CL) has traditionally focused on minimizing exemplar memory, a constraint often misaligned with modern systems where GPU time, not storage, is the primary bottleneck. This paper challenges this paradigm by investigating a more realistic regime: one where memory is abundant enough to mitigate forgetting, but full retraining from scratch remains prohibitively expensive. In this practical "middle ground", we find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones. Conversely, improved stability allows simple replay baselines to outperform the state-of-the-art methods at a fraction of the GPU cost. To address this newly surfaced trade-off, we propose Weight Space Consolidation, a lightweight method that combines (1) rank-based parameter resets to restore plasticity with (2) weight…
Peer Reviews
Decision·ICLR 2026 Poster
Pros: 1. The overall point about gpu compute constraints being more than storage is good. Exploration of such middle-ground strategies makes sense to me. The core idea of reseting certain parameters for plasticity also is interesting. 2. In the high-sample regime, the proposed method almost always outperforms existing baselines by 2-3 points. 3. The paper has nice ablation experiments such as comparing replay with reset (table 3), reset strategies (table 5). The plasticity loss experiment (F
Cons: 1. My main concern is with the framing of the paper for continual learning and its potential application. The point about GPU cost being dominant with storage is fine, but the main problem is that the data itself might not be available in the first place. So storage was never the real issue, it is access to data. For a practical example, assume we have llama-3.2b instruct model but we don't know what data was used in the base to instruct training. Here, the authors are essentially assumin
+ The paper investigates a new paradigm/scenario that challenges the previous assumptions. Such new thinking is always valuable. + The mathematical formulations are rigorous and I discovered zero mistakes there. + The new paradigm is not just an "assumption," they have evidence (Section 3) to empirically demonstrate that the new assumption is valid.
- While the mathematical formulation is rigorous, the theoretical foundation on why this new approach would work is lacking. - The proposed approach reads like A (rank-based parameter resets) + B (weight averaging). A + B isn't always necessarily bad, but both A and B are the results of prior work, so what's the technical contribution here with this approach?
- The paper is well motivated and written. The authors raise the challenge that, in current systems, memory is not the most important constraint, as GPU time is more costly. Presenting references and numbers, the paper encourages researchers not to focus on minimising memory size. - However, there are scenarios where having a small memory is required, such as when resource constraints are imposed or when the privacy of previously learned data is an issue. These are not mentioned or explained
- A common approach to increase the plasticity of memory-based methods is to concatenate buffer samples with current-task samples at the batch level. This is different from what is shown in the paper, where the full batch is sampled from the intersection of the current data and the buffer. As shown in the paper, this last approach suffers from plasticity because it is least likely to sample data from the current task; however, by concatenating at the batch level, there is no plasticity problem.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Machine Learning and ELM
