From Offline to Online Memory-Free and Task-Free Continual Learning via Fine-Grained Hypergradients
Nicolas Michel, Maorong Wang, Jiangpeng He, Toshihiko Yamasaki

TL;DR
This paper proposes a novel approach for online continual learning that combines lightweight prototypes with fine-grained hypergradients to re-balance gradient updates, enabling memory-free methods to perform effectively in sequential data streams.
Contribution
It introduces Fine-Grained Hypergradients for rebalancing gradients in online continual learning, improving performance of memory-free methods without requiring task boundaries or complex scheduling.
Findings
Augmenting offline methods with prototypes improves online performance.
Hypergradient reweighting addresses gradient imbalance in online learning.
The combined approach surpasses existing online continual learning baselines.
Abstract
Continual Learning (CL) aims to learn from a non-stationary data stream where the underlying distribution changes over time. While recent advances have produced efficient memory-free methods in the offline CL (offCL) setting, where tasks are known in advance and data can be revisited, online CL (onCL) remains dominated by memory-based approaches. The transition from offCL to onCL is challenging, as many offline methods rely on (1) prior knowledge of task boundaries and (2) sophisticated scheduling or optimization schemes, both of which are unavailable when data arrives sequentially and can be seen only once. In this paper, we investigate the adaptation of state-of-the-art memory-free offCL methods to the online setting. We first show that augmenting these methods with lightweight prototypes significantly improves performance, albeit at the cost of increased Gradient Imbalance, resulting…
Peer Reviews
Decision·Submitted to ICLR 2026
1) The topic is timely and relevant, targeting the underexplored Offline→Online transition in CL with clear theoretical and practical significance. 2) The proposed P+FGH framework effectively addresses two core challenges of online CL — catastrophic forgetting and gradient imbalance — through a minimal-intrusive and generalizable design. 3) Experiments are comprehensive, covering diverse datasets and learning rate settings, demonstrating the method’s robustness and transferability.
1) The online scenario remains quasi-online, relying on pre-defined task splits rather than fully stream-based settings, limiting realism. 2) The novelty of both P and FGH is moderate: the prototype update mirrors CoPE (2021), and FGH lacks formal convergence or stability analysis and clear differentiation from prior hypergradient methods. 3) Recent baselines (e.g., PROL 2025, PMLR 2025) are missing, and parameter details (γ, β₁/β₂, Si- Blurry settings) are insufficiently reported, affecting
The paper writing is clear. Experiments show the effectiveness of the proposed method under the proposed setting.
Limited novelty: both the hyper-gradient and prototype based memory are not new, they have been widely used in previous works for adapting learning rates (https://arxiv.org/pdf/1703.04782) and prevent forgetting (https://arxiv.org/pdf/2308.00301) already. Experiment setting and claims: This work claims to be online and memory free, however, it uses cached prototypes which is also just a form of memory, without having other methods using the same compute, memory and storage, it is not fair to c
1. The problem addressed by the paper online, memory-free, task-free continual learning is indeed a highly challenging and practically significant direction in the current field. 2. The combined framework proposed in this paper achieves outstanding performance in experiments, especially under the 'multi-learning-rate evaluation' paradigm designed by the authors, showcasing the robustness of their method.
1. The entire work can be viewed as an effective combination of two known techniques (prototypes and hypergradient descent), making the contribution more empirical than conceptual. The performance improvement from FGH largely stems from enhancing plasticity in the online setting; Equation (7) progressively increases the intra-task learning rate to boost plasticity, a mechanism that has been explored in prior work [1]. 2. Regarding catastrophic forgetting, the method essentially relies on protot
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and ELM · Machine Learning in Healthcare
