TL;DR
This paper introduces a novel continual learning method using Douglas-Rachford Splitting to balance plasticity and stability more effectively without auxiliary modules.
Contribution
It reformulates continual learning as a negotiation between two objectives via DRS, avoiding complex strategies like replay or regularization.
Findings
Achieves better stability-plasticity balance without auxiliary modules
Provides a more principled and stable learning dynamic
Simplifies continual learning framework
Abstract
Learning from a stream of tasks usually pits plasticity against stability: acquiring new knowledge often causes catastrophic forgetting of past information. Most methods address this by summing competing loss terms, creating gradient conflicts that are managed with complex and often inefficient strategies such as external memory replay or parameter regularization. We propose a reformulation of the continual learning objective using Douglas-Rachford Splitting (DRS). This reframes the learning process not as a direct trade-off, but as a negotiation between two decoupled objectives: one promoting plasticity for new tasks and the other enforcing stability of old knowledge. By iteratively finding a consensus through their proximal operators, DRS provides a more principled and stable learning dynamic. Our approach achieves an efficient balance between stability and plasticity without the need…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
+ Introduces a conceptually novel connection between DRS and CL. I have not seen the application of operator splitting methods in continual learning before + The alternation between plasticity (data likelihood) and stability (Renyi prior alignment) provides an interpretable framework linking optimization dynamics to approximate Bayesian inference + The use of Renyi divergence as a stability regularizer is something new I think and theoretically motivated by avoiding the KL zero-forcing bias. +
1. The theoretical analysis is incomplete. The propositions rely on convex assumptions and exact proximal updates while the actual algorithm is nonconvex and inexact, invalidating the stated convergence guarantees. 2. Prop-3.1 incorrectly argues that the KL divergence causes “zero forcing” in Gaussian families. It fails to formally derive the claimed advantage of the Renyi term 3. Prop-3.2 applies Douglas–Rachford results without addressing error accumulation from the K-step SGD approximation to
- The theoretical background is solid, and the motivation of leveraging DRS for a composite objective makes sense. - Clear formulation and well-written exposition of the theory. - Large number of comprehensive experiments with a significant number of baselines. - The method is replay-free and only optimization-based
## Weaknesses 1. Some claims regarding the impact of DRS are not significantly backed up or overstated. DRS is a generic solver for any regularized objective; hence, every regularization-based CL method can be reformulated similarly. Experiments replacing the SGD optimizer with DRS for existing regularization-based methods would be appreciated to showcase the effectiveness of DRS. 2 The paper combines the optimizer DRS with other changes (Rényi divergence, Bayesian latent models, tuned hyperpara
1. The biggest advantage of this paper is that it has minimal storage overhead while maintaining high performance. The method is entirely replay-free and doesn't require additional model components, making it practical for resource-constrained settings. 2. The paper offers a fresh perspective by reframing continual learning as an optimization problem rather than an objective design problem. This aligns with recent work [1] showing that how we optimize can be as important as what we optimize. 3
1. My biggest concern is the inconsistency of the motivation. The paper's central claim that optimization strategy is the fundamental problem in CL needs stronger validation. While we acknowledge that DRS and Rényi divergence might work best together, the paper doesn't show why this combination is necessary. At minimum, testing DRS with KL divergence would help us understand whether the improvements come from the optimization strategy or the divergence choice. Though the authors focus on replay-
1. Introducing DRS into continual learning seems to be novel. 2. The method was tested across a wide range of datasets, underlining its robustness and broad applicability. 3. Demonstrates substantial improvements in minimizing forgetting and enhancing knowledge transfer, making notable progress compared to traditional approaches.
1. The paper is poorly written: * In the proof for the proposition, the authors just cited a lot of previous works, using their results. It's hard for the reviewer to verify the correctness. * The paper lists a lot of baselines without indicating what specific methods they use. It's hence hard to verify why the proposed method is superior. 2. Lack of Sequential Meta-Learning Comparison: Given that the proposed approach resembles sequential meta-learning, comparisons with similar Bayesi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications
