Multiple Modes for Continual Learning

Siddhartha Datta; Nigel Shadbolt

arXiv:2209.14996·cs.LG·September 30, 2022

Multiple Modes for Continual Learning

Siddhartha Datta, Nigel Shadbolt

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Mode-Optimized Task Allocation (MOTA), a novel continual learning strategy that trains multiple parameter modes in parallel and optimizes task assignment, improving adaptability across various distribution shifts.

Contribution

The paper proposes MOTA, a new method that trains multiple parameter modes simultaneously and optimizes task allocation, enhancing continual learning performance.

Findings

01

MOTA outperforms baseline strategies in continual learning tasks.

02

MOTA effectively handles sub-population, domain, and task shifts.

03

Empirical results show improved retention and adaptation.

Abstract

Adapting model parameters to incoming streams of data is a crucial factor to deep learning scalability. Interestingly, prior continual learning strategies in online settings inadvertently anchor their updated parameters to a local parameter subspace to remember old tasks, else drift away from the subspace and forget. From this observation, we formulate a trade-off between constructing multiple parameter modes and allocating tasks per mode. Mode-Optimized Task Allocation (MOTA), our contributed adaptation strategy, trains multiple modes in parallel, then optimizes task allocation per mode. We empirically demonstrate improvements over baseline continual learning strategies and across varying distribution shifts, namely sub-population, domain, and task shift.

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. The idea of applying mixture of sub-modes, instead of single model with large amount of parameters, I think is intuitively reasonable, and not so many work focused on parameter based view; 2. The author describes specifically on reducing parameter space drift between different tasks, with corresponding theoretical analysis. Experimental result also show the effectiveness of this approach.

Weaknesses

1. During the experiment, the author majorly compared with EWC, I think some recently work that focused on similar idea (not only parameter drift), e.g., ensemble on network. Should also be discussed and compared, for example, Continual Learning Beyond a Single Model, Dynamic Network Expansion and so on [1,2]. 2. Some unclear description, e.g., a task has a high level of certainty/uncertainty, how could we measure the degree of such tasks? On sec 3.1, for updating each mode with respect to the

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

I find the idea of utilizing multiple 'modes,' which are encouraged to develop different sets of parameters, and then attempting to update them appropriately and selectively, to be truly intriguing. Additionally, I believe that the comment on the motivating factor behind the notion that the way most regularization methods are used to enforce model stability is quite sound.

Weaknesses

I believe that the primary issue with this work is its lack of clarity and organization. It is quite challenging to read, and I attribute this difficulty not to any inherent technical complexity in the proposed method but rather to its poor presentation. I find this to be particularly unfortunate and frustrating because, in my opinion, the work has the potential to make a valuable contribution. It's worth noting that the manuscript falls almost a page short of the total length allowed for this v

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

The proposed method is intuitive and shows some advantages in performance compared to the existing baselines.

Weaknesses

1. The initial task seems to be important as the modes are computed initially using the first task. Experiments about this should be included. 2. The authors should report the training time of the proposed method as well as that of all the baselines. Seems like the proposed method is computationally expensive as it computes the gradient of the parameters in each mode in a task (line 7 Alg.1), and it computes another gradient for the combined parameter (line 12 Alg.1). 3. The task split is not cl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning