Hyperparameters in Continual Learning: A Reality Check

Sungmin Cha; Kyunghyun Cho

arXiv:2403.09066·cs.LG·October 30, 2025·1 cites

Hyperparameters in Continual Learning: A Reality Check

Sungmin Cha, Kyunghyun Cho

PDF

Open Access 3 Reviews

TL;DR

This paper critiques current hyperparameter tuning practices in continual learning, proposing a new evaluation protocol that better assesses the true generalization and capacity of algorithms across unseen scenarios.

Contribution

It introduces the Generalizable Two-phase Evaluation Protocol (GTEP) to improve the assessment of continual learning algorithms' generalizability and real-world applicability.

Findings

01

Most state-of-the-art algorithms overestimate their performance under conventional evaluation.

02

Many algorithms fail to replicate their reported results when evaluated with GTEP.

03

The new protocol reveals significant overestimations in existing continual learning benchmarks.

Abstract

Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

The main strength of the paper is the extensive experimental evaluation conducted under a rigorous evaluation protocol, leading to an important insight—the superior performance of some recent class-incremental methods may be due to meta-overfitting to the particular evaluation set through hyperparameter optimization. Challenging the dominant, flawed approach to evaluating continual learning algorithms is a valuable contribution that will hopefully help steer the community towards a more discipli

Weaknesses

Poor presentation and structure are the main weaknesses of the paper. Figure 4 (b) is perhaps the most important result, yet it is not given a prominent place. Figure 3 and Figure 7 could easily be short tables. Figure 1 and 2 should be simplified and would work together as a side-by-side comparison. Limiting the analysis to the 10-task and 20-task scenario, respectively, would allow to simplify Figures 5 and 9 and make them easier to parse. BEEF should be dropped from the figures (and, arguably

Reviewer 02Rating 3Confidence 5

Strengths

1. This paper aims to tackle the class-incremental learning problem, which is important to the machine learning field. 2. The topic of hyper-parameter robustness is interesting and has not been investigated in the CIL field 3. The authors have done extensive experiments to investigate the performance of different methods.

Weaknesses

1. Although the authors have done extensive experiments in their new CIL setting, my major concern lies in the rationality of it. In typical machine learning scenarios, the training and testing data are i.i.d. sampled from the same training set. In other words, we train a model, evaluate it on the validation set, and utilize the best model to test on the test set (which has the same data distribution as the validation set). However, the authors advocate using the different data distributions for

Reviewer 03Rating 5Confidence 5

Strengths

1. I appreciate the claim that the commonly used protocol of selecting hyperparameters for continual learning methods may not be optimal in applications, given that the old training samples are largely inaccessible. 2. The authors perform extensive experiments with a variety of continual learning methods under the proposed evaluation protocol.

Weaknesses

This paper is essentially based on intuitive ideas and the empirical results are not very clear. It fails to cover many critical considerations in real-world applications. 1. The authors highlighted for many times that the two phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. However, this consideration cannot fully reflect the possible differences across continual learning tasks, such as imbalanced classes per task, imbalanced train

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Gaussian Processes and Bayesian Inference · Anomaly Detection Techniques and Applications

MethodsFocus