Heads collapse, features stay: Why Replay needs big buffers
Giulia Lanzillotta, Damiano Meier, Thomas Hofmann

TL;DR
This paper investigates the paradox in continual learning where neural networks retain feature representations but forget task-specific outputs, revealing that small buffers preserve features but not shallow predictions, and larger buffers are needed for shallow forgetting.
Contribution
It formalizes deep versus shallow forgetting, extends Neural Collapse to sequential learning, and explains buffer size effects on different levels of forgetting in continual learning.
Findings
Small buffers prevent deep forgetting by maintaining feature geometry.
Large buffers are necessary to mitigate shallow forgetting and preserve classifier boundaries.
Replay asymptotically guarantees linear separability retention regardless of buffer size.
Abstract
A persistent paradox in continual learning (CL) is that neural networks often retain linearly separable representations of past tasks even when their output predictions fail. We formalize this distinction as the gap between deep (feature-space) and shallow (classifier-level) forgetting. We reveal a critical asymmetry in Experience Replay: while minimal buffers successfully anchor feature geometry and prevent deep forgetting, mitigating shallow forgetting typically requires substantially larger buffer capacities. To explain this, we extend the Neural Collapse framework to the sequential setting. We characterize deep forgetting as a geometric drift toward out-of-distribution subspaces and prove that any non-zero replay fraction asymptotically guarantees the retention of linear separability. Conversely, we identify that the ``strong collapse'' induced by small buffers leads to…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is well written and clearly organized, with a pleasant and coherent flow of discussion. The topic addressed is novel and engaging. Moreover, the theoretical analysis is insightful and clearly explained.
While the theoretical discussion is sound and convincing, I have some concerns regarding the empirical analysis. First, it is unclear why the authors chose ResNet and ViT as reference models. It seems that the selected architectures could significantly influence the observed behaviors and results. If this is the case, the authors should explicitly discuss this aspect. Otherwise, a justification of why the chosen architectures do not affect the outcomes should be provided. Along the same lines,
1. This work proposes a novel framework for replay-based continual learning. The results and implications are meaningful and helpful to the community. 2. The authors provide a sufficient and comprehensive theoretical analysis for replay-based continual learning, considering three different setups and showing the effect of replay. 3. The empirical study is consistent with the theoretical findings, across both real-world and simulated datasets. 4. The work is well structured and easy to follow fo
1. In Theorems 1, 2, and 3, it seems that $\nu = 1 - \eta \lambda$ is required to be non-negative or greater than $-1$, but I do not find any explicit condition on $\nu$. 2. The explanation of why replay cannot strongly mitigate shallow forgetting is not convincing to me. The authors argue that the approximation error is the key reason, but there is no formal result to support this claim, which limits the contribution of this paper. 3. In the experimental results, the authors only vary the bu
Extends Neural Collapse analysis to continual learning and multi-head architectures—an unexplored direction. Uses asymptotic analysis and connects NC with OOD theory in a rigorous manner. Establishes a bridge between NC, CL, and OOD detection, enriching all three research domains.
While conceptually strong, it provides limited actionable guidance for improving CL performance.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning
