Hyperparameters in Continual Learning: A Reality Check
Sungmin Cha, Kyunghyun Cho

TL;DR
This paper critiques current hyperparameter tuning practices in continual learning, proposing a new evaluation protocol that better assesses the true generalization and capacity of algorithms across unseen scenarios.
Contribution
It introduces the Generalizable Two-phase Evaluation Protocol (GTEP) to improve the assessment of continual learning algorithms' generalizability and real-world applicability.
Findings
Most state-of-the-art algorithms overestimate their performance under conventional evaluation.
Many algorithms fail to replicate their reported results when evaluated with GTEP.
The new protocol reveals significant overestimations in existing continual learning benchmarks.
Abstract
Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The main strength of the paper is the extensive experimental evaluation conducted under a rigorous evaluation protocol, leading to an important insight—the superior performance of some recent class-incremental methods may be due to meta-overfitting to the particular evaluation set through hyperparameter optimization. Challenging the dominant, flawed approach to evaluating continual learning algorithms is a valuable contribution that will hopefully help steer the community towards a more discipli
Poor presentation and structure are the main weaknesses of the paper. Figure 4 (b) is perhaps the most important result, yet it is not given a prominent place. Figure 3 and Figure 7 could easily be short tables. Figure 1 and 2 should be simplified and would work together as a side-by-side comparison. Limiting the analysis to the 10-task and 20-task scenario, respectively, would allow to simplify Figures 5 and 9 and make them easier to parse. BEEF should be dropped from the figures (and, arguably
1. This paper aims to tackle the class-incremental learning problem, which is important to the machine learning field. 2. The topic of hyper-parameter robustness is interesting and has not been investigated in the CIL field 3. The authors have done extensive experiments to investigate the performance of different methods.
1. Although the authors have done extensive experiments in their new CIL setting, my major concern lies in the rationality of it. In typical machine learning scenarios, the training and testing data are i.i.d. sampled from the same training set. In other words, we train a model, evaluate it on the validation set, and utilize the best model to test on the test set (which has the same data distribution as the validation set). However, the authors advocate using the different data distributions for
1. I appreciate the claim that the commonly used protocol of selecting hyperparameters for continual learning methods may not be optimal in applications, given that the old training samples are largely inaccessible. 2. The authors perform extensive experiments with a variety of continual learning methods under the proposed evaluation protocol.
This paper is essentially based on intuitive ideas and the empirical results are not very clear. It fails to cover many critical considerations in real-world applications. 1. The authors highlighted for many times that the two phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. However, this consideration cannot fully reflect the possible differences across continual learning tasks, such as imbalanced classes per task, imbalanced train
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Gaussian Processes and Bayesian Inference · Anomaly Detection Techniques and Applications
MethodsFocus
