On Understanding of the Dynamics of Model Capacity in Continual Learning

Supriyo Chakraborty; Krishnan Raghavan

arXiv:2508.08052·cs.LG·August 15, 2025

On Understanding of the Dynamics of Model Capacity in Continual Learning

Supriyo Chakraborty, Krishnan Raghavan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CLEMC, a dynamic measure of model capacity in continual learning, revealing that neural networks' ability to learn new tasks diminishes with changing task distributions, regardless of architecture.

Contribution

The paper develops a theoretical model of effective capacity in continual learning and demonstrates its non-stationary nature across various neural network architectures.

Findings

01

Effective capacity decreases with task distribution shifts.

02

The stability-plasticity balance point is inherently non-stationary.

03

Model capacity dynamics are consistent across architectures.

Abstract

The stability-plasticity dilemma, closely related to a neural network's (NN) capacity-its ability to represent tasks-is a fundamental challenge in continual learning (CL). Within this context, we introduce CL's effective model capacity (CLEMC) that characterizes the dynamic behavior of the stability-plasticity balance point. We develop a difference equation to model the evolution of the interplay between the NN, task data, and optimization procedure. We then leverage CLEMC to demonstrate that the effective capacity-and, by extension, the stability-plasticity balance point is inherently non-stationary. We show that regardless of the NN architecture or optimization method, a NN's ability to represent new tasks diminishes when incoming task distributions differ from previous ones. We conduct extensive experiments to support our theoretical findings, spanning a range of architectures-from…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

1. The authors validate their theoretical results through extensive experiments using datasets relevant to the CL paradigm. 2. The derived theoretical results apply to various NN architectures, including feed-forward, convolutional, graph neural networks, and transformer-based models.

Weaknesses

1. I went through the derivation of Lemma 1 and Theorem 1, which forms the basis for Theorems 2 and 3. However, I identified some potential issues that may have led to inaccuracies in the final results. 2. I believe the paper's presentation could be improved. There are several areas where the definitions regarding "capacity" appear to conflict, and important experimental details are missing, preventing the reader from fully understanding the results. **Disclaimer**: It is possible that I may h

Reviewer 02Rating 5Confidence 2

Strengths

S1: The topic of study is a very important one: Continual learning. The paper has a solid related works section motivating why this study is needed. S2: The theory in the study is clear, and its conclusions are well highlighted. S3: Experiments to confirm the predictions of the theory are done on different model architectures, which is essential to capture the generality of the theory.

Weaknesses

W1: Some figures are unclear whether they support the claim. In specific, In Figure 4a, the capacity seems to drop back to a low value even though the initial bump gets bigger with more tasks. In Figure 5, there seems to be more changes seen in the weight, not explained by $\partial V^{*} / \partial x$. W2: The FNN and GNN results could use more complex data to confirm its findings. While the authors mention that the experimental section is mostly meant for the ease of analysis, it is unclear h

Reviewer 03Rating 5Confidence 3

Strengths

1) Theoretical justification is provided for the proposed perspective. 2) Extensive experiments is conducted across various architectures, from small feed-forward (FNN) and convolutional networks (CNN) to medium-sized graph neural networks (GNN) and large transformer-based language models (LLMs). 3) They investigate the interplay among model, task, and optimization, a trio previously examined only in pairwise combinations—either model and optimization or model and data, which I think is intere

Weaknesses

1) The paper misses a key theoretical study on continual learning that explores catastrophic forgetting and task similarity[1]. 2) Could the authors discuss how CLEMC handles task order? I'm particularly interested in understanding CLEMC's sensitivity to the sequence of tasks. 3) Could the authors provide a more detailed and clear explanation of Figure 1? I find it difficult to understand. 4) In Figure 6, for the books task, the authors note that ER (Experience Replay) results in a lower cos

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Artificial Intelligence in Healthcare and Education · Advanced Graph Neural Networks