Continual Learning by Asymmetric Loss Approximation with Single-Side   Overestimation

Dongmin Park; Seokil Hong; Bohyung Han; Kyoung Mu Lee

arXiv:1908.02984·cs.LG·October 23, 2019

Continual Learning by Asymmetric Loss Approximation with Single-Side Overestimation

Dongmin Park, Seokil Hong, Bohyung Han, Kyoung Mu Lee

PDF

TL;DR

This paper introduces a novel continual learning method that uses asymmetric loss approximation with overestimation to mitigate catastrophic forgetting, achieving near-optimal accuracy without additional network components.

Contribution

It proposes a new asymmetric loss approximation technique for continual learning that overestimates unobserved task sides, improving scalability and accuracy.

Findings

01

Achieves state-of-the-art accuracy on benchmark datasets.

02

Effectively mitigates catastrophic forgetting.

03

Operates without additional network components.

Abstract

Catastrophic forgetting is a critical challenge in training deep neural networks. Although continual learning has been investigated as a countermeasure to the problem, it often suffers from the requirements of additional network components and the limited scalability to a large number of tasks. We propose a novel approach to continual learning by approximating a true loss function using an asymmetric quadratic function with one of its sides overestimated. Our algorithm is motivated by the empirical observation that the network parameter updates affect the target loss functions asymmetrically. In the proposed continual learning framework, we estimate an asymmetric loss function for the tasks considered in the past through a proper overestimation of its unobserved sides in training new tasks, while deriving the accurate model parameter for the observable sides. In contrast to existing…

Tables5

Table 1. Table 1: Sensitivity to a 𝑎 a on the permuted MNIST

$a$	0.8	1.0	2.0	3.0	4.0	5.0
30 tasks	0.91	0.92	0.94	0.94	0.94	0.94
100 tasks	0.59	0.62	0.79	0.79	0.78	0.75

Table 2. Table 2: Algorithm comparison in terms of accuracy, forgetting and intransigence measures [ 6 ]

	Permuted MNIST			Split CIFAR10/100
Measure	$A_{30}$	$F_{30}$	$I_{30}$	$A_{30}$	$F_{30}$	$I_{30}$
EWC	0.738	0.243	0.010	-	-	-
SI	0.752	0.232	0.008	0.632	0.110	0.137
Ours	0.944	0.027	0.015	0.697	0.026	0.123

Table 3. Table C: Network Architecture for Permuted MNIST Experiment

Layer Type	Layer Size(or value)	input size
Dense + ReLU	2000	1x1x784
Dense + ReLU	2000	1x1x2000
Dense	10	1x1x2000
Softmax	-	1x1x10

Table 4. Table D: Network Architecture for Split CIFAR-10/CIFAR-100 Experiment

Layer Type	Layer Size(or value)	input size
Conv + ReLU	3x3x4	32x32x3
Conv + ReLU	3x3x4	32x32x4
Max Pooling	2x2	32x32x4
Dropout	0.25	16x16x4
Conv + ReLU	3x3x8	16x16x4
Conv + ReLU	3x3x8	16x16x8
Max Pooling	2x2	16x16x8
Dropout	0.25	8x8x8
Dense + ReLU	64	1x1x512
Dropout	0.5	1x1x64
Dense	90	1x1x64
Softmax (Per-task)	-	1x1x90

Table 5. Table E: Network Architecture for Tiny ImageNet Experiment

Layer Type	Layer Size(or value)	input size
Conv + ReLU	3x3x32	224x224x3
Max Pooling	2x2	224x224x32
Dropout	0.25	112x112x32
Conv + ReLU	3x3x32	112x112x32
Max Pooling	2x2	112x112x32
Dropout	0.25	56x56x32
Conv + ReLU	3x3x64	56x56x32
Max Pooling	2x2	56x56x64
Dropout	0.25	28x28x64
Conv + ReLU	3x3x64	28x28x64
Max Pooling	2x2	28x28x64
Dropout	0.25	14x14x64
Conv + ReLU	3x3x64	14x14x64
Max Pooling	2x2	14x14x64
Dropout	0.25	7x7x64
Dense + ReLU	2048	1x1x3136
Dropout	0.5	1x1x2048
Dense	180	1x1x2048
Softmax (Per-task)	-	1x1x180

Equations46

\tilde{L}^{n} = L^{n} + c L_{s}^{n - 1} = L^{n} + surrogate loss c k \sum \hat{Ω}_{k}^{n - 1} (θ_{k} - \hat{θ}_{k}^{n - 1})^{2},

\tilde{L}^{n} = L^{n} + c L_{s}^{n - 1} = L^{n} + surrogate loss c k \sum \hat{Ω}_{k}^{n - 1} (θ_{k} - \hat{θ}_{k}^{n - 1})^{2},

L_{s}^{n} (θ_{k}) = \hat{Ω}_{k}^{n} (θ_{k} - \hat{θ}_{k}^{n})^{2},

L_{s}^{n} (θ_{k}) = \hat{Ω}_{k}^{n} (θ_{k} - \hat{θ}_{k}^{n})^{2},

\hat{Ω}_{k}^{n} \approx \frac{ω _{k}^{n}}{( θ ^ _{k}^{n} - θ ^ _{k}^{n - 1} ) ^{2}} + \hat{Ω}_{k}^{n - 1},

\hat{Ω}_{k}^{n} \approx \frac{ω _{k}^{n}}{( θ ^ _{k}^{n} - θ ^ _{k}^{n - 1} ) ^{2}} + \hat{Ω}_{k}^{n - 1},

α (θ_{k}) \equiv (θ_{k} - \hat{θ}_{k}^{n}) (\hat{θ}_{k}^{n - 1} - \hat{θ}_{k}^{n}) .

α (θ_{k}) \equiv (θ_{k} - \hat{θ}_{k}^{n}) (\hat{θ}_{k}^{n - 1} - \hat{θ}_{k}^{n}) .

L_{s}^{n} (θ_{k}, a) = ⎩ ⎨ ⎧ \hat{Ω}_{k}^{n} (θ_{k} - \hat{θ}_{k})^{2} (a \hat{Ω}_{k}^{n} + ϵ) (θ_{k} - \hat{θ}_{k})^{2} if α (θ_{k}) > 0 if α (θ_{k}) \leq 0,

L_{s}^{n} (θ_{k}, a) = ⎩ ⎨ ⎧ \hat{Ω}_{k}^{n} (θ_{k} - \hat{θ}_{k})^{2} (a \hat{Ω}_{k}^{n} + ϵ) (θ_{k} - \hat{θ}_{k})^{2} if α (θ_{k}) > 0 if α (θ_{k}) \leq 0,

\tilde{L}^{n}

\tilde{L}^{n}

= L^{n} + c k \sum L_{s}^{n - 1} (θ_{k}) .

\hat{Ω}_{k}^{n}

\hat{Ω}_{k}^{n}

= \frac{L _{s}^{n} ( θ ^ _{k}^{n - 1} ) - L _{s}^{n} ( θ ^ _{k}^{n} )}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}}

= \frac{ω _{k}^{n} + ω _{k}^{1 : (n - 1)}}{( θ ^ _{k}^{n} - θ ^ _{k}^{n - 1} ) ^{2}} .

L_{s}^{n} (\hat{θ}_{k}^{n})

L_{s}^{n} (\hat{θ}_{k}^{n})

L_{s}^{n} (\hat{θ}_{k}^{n})

ω_{k}^{n}

ω_{k}^{n}

ω_{k}^{1 : (n - 1)}

= - c L_{s}^{n - 1} (\hat{θ}_{k}^{n}) .

ω_{k}^{1 : (n - 1)} = - c^{'} L_{s}^{n - 1} (\hat{θ}_{k}^{n}, a^{'}),

ω_{k}^{1 : (n - 1)} = - c^{'} L_{s}^{n - 1} (\hat{θ}_{k}^{n}, a^{'}),

L^{n} (θ) = k \sum L^{n} (θ_{k}) .

L^{n} (θ) = k \sum L^{n} (θ_{k}) .

L_{s}^{n} (θ_{k}) \equiv \hat{Ω}_{k}^{n} (θ_{k} - \hat{θ}_{k}^{n})^{2} = L^{n} (θ_{k}) + c L_{s}^{n - 1} (θ_{k}),

L_{s}^{n} (θ_{k}) \equiv \hat{Ω}_{k}^{n} (θ_{k} - \hat{θ}_{k}^{n})^{2} = L^{n} (θ_{k}) + c L_{s}^{n - 1} (θ_{k}),

\hat{Ω}_{k}^{n}

\hat{Ω}_{k}^{n}

= \frac{L ^{n} ( θ _{k} ) + c Ω ^ _{k}^{n - 1} ( θ _{k} - θ ^ _{k}^{n - 1} ) ^{2}}{( θ _{k} - θ ^ _{k}^{n} ) ^{2}}

= \frac{L ^{n} ( θ ^ _{k}^{n - 1} ) - L ^{n} ( θ ^ _{k}^{n} )}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}} + \frac{L ^{n} ( θ ^ _{k}^{n} )}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}}

= \frac{L ^{n} ( θ ^ _{k}^{n - 1} ) - L ^{n} ( θ ^ _{k}^{n} )}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}} + \frac{L ^{n} ( θ ^ _{k}^{n} ) \cdot Ω ^ _{k}^{n - 1}}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2} \cdot Ω ^ _{k}^{n - 1}}

= \frac{L ^{n} ( θ ^ _{k}^{n - 1} ) - L ^{n} ( θ ^ _{k}^{n} )}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}} + \frac{L ^{n} ( θ ^ _{k}^{n} ) \cdot Ω ^ _{k}^{n - 1}}{L _{s}^{n - 1} ( θ ^ _{k}^{n} )}

= \frac{ω _{k}^{n}}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}} + \frac{L ^{n} ( θ ^ _{k}^{n} ) \cdot Ω ^ _{k}^{n - 1}}{L _{s}^{n - 1} ( θ ^ _{k}^{n} )}

\approx \frac{ω _{k}^{n}}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}} + \hat{Ω}_{k}^{n - 1},

\hat{Ω}_{k}^{n}

\hat{Ω}_{k}^{n}

= \frac{L _{s}^{n} ( θ ^ _{k}^{n - 1} )}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}}

= \frac{L _{s}^{n} ( θ ^ _{k}^{n - 1} ) - Ω ^ _{k}^{n} ( θ ^ _{k}^{n} - θ ^ _{k}^{n} ) ^{2}}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}}

= \frac{L _{s}^{n} ( θ ^ _{k}^{n - 1} ) - L _{s}^{n} ( θ ^ _{k}^{n} )}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}}

= \frac{( L ^{n} ( θ ^ _{k}^{n - 1} ) + c L _{s}^{n - 1} ( θ ^ _{k}^{n - 1} ) ) - ( L ^{n} ( θ ^ _{k}^{n} ) + c L _{s}^{n - 1} ( θ ^ _{k}^{n} ) )}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}}

= \frac{( L ^{n} ( θ ^ _{k}^{n - 1} ) - L ^{n} ( θ ^ _{k}^{n} ) ) + c ( L _{s}^{n - 1} ( θ ^ _{k}^{n - 1} ) - L _{s}^{n - 1} ( θ ^ _{k}^{n} ) )}{( θ ^ _{k}^{n - 1} - θ ^ _{k}^{n} ) ^{2}}

= \frac{ω _{k}^{n} + ω _{k}^{1 : (n - 1)}}{( θ ^ _{k}^{n} - θ ^ _{k}^{n - 1} ) ^{2}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Continual Learning by Asymmetric Loss Approximation

with Single-Side Overestimation

Dongmin Park1,2 Seokil Hong1 Bohyung Han1 Kyoung Mu Lee1

1ECE & ASRI

Seoul National University

Korea 2Samsung Electronics

Korea

[email protected], {hongceo96, bhhan, kyoungmu}@snu.ac.kr

Abstract

Catastrophic forgetting is a critical challenge in training deep neural networks. Although continual learning has been investigated as a countermeasure to the problem, it often suffers from the requirements of additional network components and the limited scalability to a large number of tasks. We propose a novel approach to continual learning by approximating a true loss function using an asymmetric quadratic function with one of its sides overestimated. Our algorithm is motivated by the empirical observation that the network parameter updates affect the target loss functions asymmetrically. In the proposed continual learning framework, we estimate an asymmetric loss function for the tasks considered in the past through a proper overestimation of its unobserved sides in training new tasks, while deriving the accurate model parameter for the observable sides. In contrast to existing approaches, our method is free from the side effects and achieves the state-of-the-art accuracy that is even close to the upper-bound performance on several challenging benchmark datasets.

1 Introduction

It is common to learn machine learning models versatile for multiple tasks in an incremental manner when new tasks are given one by one, not in a batch. Continual learning is a concept to learn a model for a large number of tasks sequentially without forgetting knowledge obtained from the preceding tasks, where the data in the old tasks are not available any more during training new ones.

Catastrophic forgetting is a critical issue in realizing continual learning using deep neural networks. With a naïve stochastic gradient descent (SGD) method, deep neural networks easily forget the knowledge obtained from the earlier tasks while adapting to the new information quickly from the incoming tasks [26, 8]. This is mainly because, without any countermeasure, the optimization based on the losses of new tasks may not be desirable to retain the knowledge about the tasks learned in the past. The catastrophic forgetting problem limits the capability and potential of deep neural networks to be applied to dynamic real-world problems that require continuous adaptation to new environments.

The continual learning or life-long learning is a well-known framework to handle the catastrophic forgetting problem. It can be categorized into three categories: architectural approach, functional approach, and structural regularization approach. The architectural and functional approaches typically need additional network components and/or batch processing. The structural regularization methods work well for a limited number of tasks, but often have scalability issues to many tasks.

This paper presents a novel continual learning framework based on asymmetric loss approximation with single-side overestimation (ALASSO), which effectively adapts to a large number of tasks. ALASSO approximates the true loss functions corresponding to the previously considered tasks asymmetrically by overestimating their unobserved sides in the parameter space while deriving the accurate quadratic approximation on the observed sides. Figure 1 illustrates the main concept of our approach; it computes the optimal parameter through a quadratic approximation in the observed side (left) while using a steep surrogate quadratic function in the unobservable side (right). The proposed algorithm also decouples the hyperparameters for the current surrogate loss approximation and the surrogate loss change of the previous tasks. This approach is motivated by our observation that updating the model parameters of deep neural networks affects target losses asymmetrically and that using the overestimated loss functions is relatively safe for the optimization without the training data of the previous tasks.

In contrast to the existing approaches, the proposed technique is free from the additional memory requirement to store the information about the previous tasks and the overhead of batch processing. ALASSO achieves the state-of-the-art performance and is even close to the accuracy upper-bounds in several challenging benchmark datasets, including the permuted MNIST, the split CIFAR-10/CIFAR-100 and the split Tiny ImageNet. In particular, we demonstrate promising results in terms of scalability and robustness to a large number of tasks.

The contribution of our work is summarized as follows:

•

We propose a novel continual learning framework by overestimating the unobserved side of a loss function in the current task and approximating the loss using an asymmetric quadratic function. This strategy facilitates a reliable loss estimation even without the training data of the previous tasks.

•

Our algorithm provides an accurate solution of the loss approximation for the previous tasks, which allows to derive the best approximation of a quadratic surrogate loss function on the observed side.

•

The proposed technique achieves the outstanding performance on several challenging benchmark datasets by large margins.

The rest of the paper is organized as follows. We first discuss the related works in Section 2, and present the technical details of our algorithm, ALASSO, in Section 4. Section 5 demonstrates the experimental results with analysis and Section 6 concludes this paper.

2 Related Work

Continual learning algorithms are categorized into three groups [37]: architectural, functional, and structural regularization approaches. This section discusses the existing methods in individual categories and and their characteristics briefly.

2.1 Architectural Approaches

The approaches in this category realize continual learning by freezing model parameters learned from the previous tasks and/or providing limited architectural variations for learning the new tasks [30, 22, 3, 20]. This framework often requires additional network components for the new tasks and the size of the network gradually increases in principle. This drawback limits the applicability to large-scale problems and is not appropriate for the configurations with limited resources such as embedded systems. Existing methods [24, 34, 15, 35, 13] often keep track of the data in the previous tasks by allocating additional memory space called episodic memory or by generating training examples using generative adversarial networks [11], but they suffer from substantial memory demands to store the information of the previous tasks. Some approaches in this category aim to improve performance using different nonlinearities such as ReLU, MaxOut, and local winner-take-all [36, 10].

An interesting work among the architectural approaches, instead of adding new network components, employs a network compression technique to identify unused or rarely used parts in the target network and allows the free space to store the information of new tasks [25, 12]. It alleviates the drawbacks of the standard approaches discussed earlier, but requires the substantial overhead of batch processing for network compression.

2.2 Functional Approaches

The functional approaches [23, 14, 32, 28] often incorporate knowledge distillation to establish continual learning. The previously learned networks are fixed and used for feature computation in training new tasks. The new networks are encouraged to learn the representations that are coherent to the features computed in the previous networks. However, the feature coherency of new examples with respect to the old ones does not always ensure the output similarity of the examples in the past and current tasks. Existing functional approaches are required to store and evaluate the previous networks for training the model for the new task, which incurs additional computational overhead.

2.3 Structural Regularization Approaches

The structural regularization approaches typically augment a penalty term to the original loss functions and discourage the updates of parameters critical to the previously learned tasks. Elastic weight consolidation (EWC) [17] and synaptic intelligence (SI) [37] employ a surrogate quadratic loss as an approximation of the real loss functions of the previous tasks. Although this is simple and effective for a small number of tasks, its performance drops drastically when there are a larger number of tasks. Memory aware synapses (MAS) [2] estimates the importance of the weights in an unsupervised manner. Incremental moment matching (IMM) [21] additionally performs a separate model-merging step after learning a new task. Variational continual learning (VCL) [27] combines the online variational inference [9, 31, 5] with Monte Carlo sampling [4] using neural networks, but it requires a relatively large amount of computational cost to infer an approximate posterior distribution. On the other hand, [33] proposes a task-specific hard attention mechanism, but, as in [30, 7, 23], the requirement of multi-head outputs—a separate output per task—limits the number of target domains considered concurrently.

Our framework falls in the structural regularization category. We design a new loss function appropriate for continual learning while it does not incur extra computational overhead such as additional network components and the need for occasional batch processing.

3 Synaptic Intelligence (SI)

The proposed method, referred to as ALASSO, is closely related to synaptic intelligence (SI) [37]. We first discuss the main idea of SI and then point out its critical problems before presenting our novel idea.

3.1 Quadratic Approximation of Loss

SI is categorized as a structural regularization method, which employs a static network architecture and does not use additional memory throughout continual learning process. To prevent the catastrophic forgetting problem and maintain the performance with respect to the previous tasks while adapting to a new task $n$ , this technique introduces a surrogate loss function $\mathcal{L}_{s}^{n-1}$ , which approximates the loss of the previous tasks and plays a role as a regularizer. Assuming that the surrogate loss is a quadratic function, the total loss function to learn the $n^{\text{th}}$ task, $\tilde{\mathcal{L}}^{n}$ , in terms of the model parameter $\theta_{k}$ is approximated by

[TABLE]

where $\mathcal{L}^{n}$ denotes a loss for the current task $n$ , $\hat{\theta}_{k}^{n-1}$ is the weight in the $k^{\text{th}}$ dimension of the estimated parameter for the previous tasks until $n-1$ , $\hat{\Omega}^{n-1}_{k}$ is a coefficient for the corresponding model parameter, and $c$ is a hyperparameter for the surrogate loss.

We minimize the total loss, $\tilde{\mathcal{L}}^{n}$ , with respect to $\theta_{k}$ , and obtain the optimized parameters, $\hat{\theta}_{k}^{n}$ . Assuming that the surrogate loss function for each parameter up to the $n^{\text{th}}$ task is defined by a quadratic function as

[TABLE]

$\hat{\Omega}^{n}_{k}$ is derived approximately by

[TABLE]

where $\omega^{n}_{k}$ denotes the difference between the losses before and after training a new task $n$ , i.e., $\omega_{k}^{n}=\mathcal{L}^{n}(\hat{\theta}_{k}^{n-1})-\mathcal{L}^{n}(\hat{\theta}_{k}^{n})$ . Refer to [37] for more details about SI.

3.2 Underestimation of Loss

SI employs a symmetric loss function for approximation with respect to the previous tasks while it can observe only a single side of the symmetric function along the trajectory of parameter update from $\hat{\theta}_{k}^{n-1}$ to $\hat{\theta}_{k}^{n}$ during the optimization process for the $n^{\text{th}}$ task. In other words, the surrogate loss function is assumed to be symmetric as shown in Eq. (1) and may not be able to model its unobserved side accurately unless the true loss function is symmetric.

However, it turns out that the true loss functions are typically asymmetric and the symmetric loss functions adopted by SI are prone to underestimate the true losses in practice. Figure 2 illustrates the variation of the losses with respect to the changes of each model parameter; $x$ -axis denotes the offset from the optimal model parameter and $y$ -axis is the difference of the cross-entropy losses, which are obtained from the first task of the permuted MNIST dataset. Note that, to visualize the asymmetry more clearly, the steeper halves of the individual graphs are located on the right hand side. Based on this observation, we consider an asymmetric loss function formulation for a more accurate and stable estimation of the surrogate loss function parametrized by $\hat{\Omega}^{n}_{k}$ .

4 Proposed Algorithm

We propose a novel structure regularizer, which mitigates the limitations of SI. The main contributions of our algorithm, ALASSO, include the introduction of the asymmetric loss function with single-side overestimation and the more accurate quadratic approximation of the loss function, which lead to remarkable performance improvement. This section presents the details about ALASSO, especially, in comparison to the existing approach, SI.

4.1 Overview

Our algorithm overestimates the unobserved sides of the approximate loss function and allows the models to learn under a harsher condition. It also derives the accurate parameter estimation of the approximate quadratic loss functions on their observed sides. To accelerate the optimization procedure and handle the conflicts between the loss computation of the current task and the loss approximation of the previous tasks, we introduce a hyperparameter decoupling technique although the values of the decoupled hyperparameters should be identical conceptually. The proposed algorithm inherits the merits of the standard structural regularization approaches such as online learning and local updates while providing the capability to maintain the crucial knowledge about the prior tasks by making the models less adaptive without the observation of the loss function. We now discuss the technical contributions and characteristics of our continual learning framework, ALASSO.

4.2 Asymmetric Loss Approximation

One possible option for a better structural regularizer in continual learning is asymmetric loss approximation of the previous tasks. We believe that the symmetric regularizer as in SI overly simplifies the true loss functions and have a critical limitation in maintaining the knowledge obtained from the previous tasks. Figure 1 illustrates why asymmetric loss approximation is effective. The approximate quadratic loss functions may be sufficiently accurate on the sides, where the true losses are observable along the model parameter updates during training. However, they may incur substantial error on their unobserved sides, so it is dangerous to assume that the true loss functions are symmetric, which is supported by Figure 2.

Based on this motivation, we now propose a simple but effective approximation approach of the true loss functions. We believe that the coefficient, $\hat{\Omega}_{k}^{n}$ , is reliable if the learnable parameter ${\theta}_{k}$ and the fixed parameter $\hat{\theta}^{n-1}_{k}$ are on the same side from $\hat{\theta}^{n}_{k}$ because ${\theta}_{k}$ can observe the true loss function during optimization process. Conversely, if $\theta_{k}$ and $\hat{\theta}^{n-1}_{k}$ are on the opposite sides with respect to $\hat{\theta}^{n}_{k}$ , $\hat{\Omega}_{k}^{n}$ is unreliable. Figure 3 visualizes the relations between the parameters for the both cases.

To model the relationship between the variables, we introduce the following function:

[TABLE]

In our framework, since the true loss function is not fully observable, we introduce an additional parameter, $\hat{\Omega}_{k}^{n}(\geq 0)$ , for the quadratic approximation, which results in the asymmetric loss function as

[TABLE]

where $a(>1)$ is a hyperparameter to control the degree of overestimation and $\epsilon$ is a small positive number to make the loss overestimated even when $\hat{\Omega}_{k}^{n}=0$ 111Since $\hat{\Omega}_{k}^{n}$ is non-negative, $(a\hat{\Omega}_{k}^{n}+\epsilon)>\hat{\Omega}_{k}^{n}$ . From now, we omit the hyperparameter $a$ in $\mathcal{L}_{s}^{n}\left(\theta_{k},a\right)$ for notational simplicity, i.e., $\mathcal{L}_{s}^{n}\left(\theta_{k},a\right)\equiv\mathcal{L}_{s}^{n}\left(\theta_{k}\right)$

The asymmetric loss function in Eq. (5) is used to define the total loss $\tilde{\mathcal{L}}^{n}$ , which is given by

[TABLE]

Note that $\mathcal{L}_{s}^{n-1}$ can also be interpreted as a regularizer for the $n^{\text{th}}$ task. One remaining concern is how to compute $\hat{\Omega}_{k}^{n}$ in Eq. (5), which is discussed next.

4.3 Accurate Quadratic Approximation

In addition to the overestimation of the true loss in the unobserved sides for a new task $n$ , our algorithm estimates the optimal coefficient $\hat{\Omega}_{k}^{n}$ for the quadratic approximation of the loss function in its observed sides. Note that, as presented in Eq. (5), $\hat{\Omega}_{k}^{n}$ affects the approximation of the loss function in both sides; it is critical to derive $\hat{\Omega}_{k}^{n}$ accurately for performance improvement.

The quadratic surrogate loss function in SI is determined by approximating its parameters $\hat{\Omega}_{k}^{n}$ as in Eq. (3), which is derived from Eq. (2). Instead of using the equation, we present a new derivation that leads to the exact quadratic approximation, which is given by

[TABLE]

Note that $\hat{\Omega}_{k}^{n}$ means the change of the surrogate loss $\mathcal{L}^{n}_{s}$ when the model parameter changes from $\hat{\theta}^{n-1}_{k}$ to $\hat{\theta}^{n}_{k}$ during the learning process of the $n^{\text{th}}$ task. The following properties and definitions are required to derive Eq. (7):

[TABLE]

and

[TABLE]

The second equality of Eq. (7) is simply obtained from plugging $\hat{\theta}_{k}^{n-1}$ into $\theta_{k}$ in both the numerator and denominator. The numerator in Eq. (7) has two terms corresponding to Eq. (10) and (11), which implies that $\hat{\Omega}_{k}^{n}$ is given by the sum of the two parameter importances—the loss differences with respect to the parameter changes—of the new and previous tasks. Eq. (7) is actually different from Eq. (3) only in the importance term of the previous tasks. In principle, the parameter importance of the previous tasks should be updated depending on the identified local minimum ( $\hat{\theta}_{k}^{n}$ ) of the quadratic function. Note that our formulation presented in Eq. (7) realizes this exactly while Eq. (3) uses the fixed importance, $\hat{\Omega}_{k}^{n-1}$ , resulting in poor approximation.

4.4 Parameter Decoupling

The overestimated loss approximation is effective to reduce errors that happen inevitably in the previous tasks, but the optimization with this technique may suffer from slow convergence due to its inherent limitation. Specifically, some hyperparameters, such as $a$ and $c$ , affect the objective function in one way or another depending on where they occur in our formulation. This is because we approximate the real loss for all the previous tasks using a single asymmetric function and it is almost impossible to consider all the possible combinations of the approximate functions estimated in the past.

In our formulation, there are two different occurrences of hyperparameters; one is related to the gradient computation in SGD and the other is in calculating $\hat{\Omega}^{n}_{k}$ for loss approximation. For example, increasing $c$ in Eq. (6) makes the model consider the previous tasks more while increasing $c$ in Eq. (11) results in more weight on the current task by underestimating $\hat{\Omega}^{n}_{k}$ .

To handle the inconsistent impact of identical hyperparameters on the optimization process, we decouple the parameters into two sets; a set of parameters used for computation of $\omega_{k}^{1:(n-1)}$ and the other set of parameters used for surrogate loss estimation.

Then, when we compute $\omega^{1:(n-1)}_{k}$ in Eq. (11), the hyperparameters $a$ and $c$ are decoupled from the other equations as

[TABLE]

where $a$ and $c$ in Eq. (11) are replaced by $a^{\prime}$ and $c^{\prime}$ , respectively.

4.5 Discussion

Figure 4 illustrates the promising results of the proposed algorithm in comparison to SI, which is one of the state-of-the-art methods. The figure presents how the accuracy of each task that is learned earlier changes over time as new tasks are added one by one. We notice that the amount of degradation in ALASSO (ours) is much smaller than SI and the accuracy differences of the two algorithms at the same number of tasks are getting larger as time goes by. These results clearly show the potential of our algorithm.

We claim that the proposed algorithm is practically good because it does not involve side effects such as architectural modification, network size increase, additional memory requirements, batch processing, multiple execution of networks, and inference on multi-head networks. We only need to store several variables such as $\Omega^{n}_{k}$ , $\hat{\theta}^{n}_{k}$ , and $\omega^{n}_{k}$ during training, and perform additional operations to compute the surrogate losses.

5 Experiments

This section demonstrates performance of our algorithm compared to existing approaches on the standard datasets.

5.1 Datasets and Algorithms

We employ three standard benchmark datasets to evaluate the proposed continual learning framework, which include the permuted MNIST, the split CIFAR-10/CIFAR-100 and the split Tiny ImageNet. The permuted MNIST is a synthetic dataset based on MNIST [19], where all pixels of an image in MNIST are permuted differently but coherently in each task. This dataset contains a large number of tasks and is widely used for evaluation of continual learning algorithms [36, 10, 17, 37]. The split CIFAR-10/CIFAR-100 dataset is generated from CIAFR-10 and CIFAR-100 [18] while the split Tiny ImageNet is derived from Tiny ImageNet [1]. These datasets divide their target classes into multiple subsets, which correspond to individual tasks.

We compare our algorithm with the existing state-of-the-art methods including naïve SGD [29, 16], SGD with dropout (SGD+dropout) [10], VCL [27], EWC [17], SI [37], IMM [21] and MAS [2]. To estimate the upper-bound accuracy of our approach, we present the results from the models learned for individual tasks and all the tasks in a batch, which are denoted by single_task and multi_task, respectively.

5.2 Training Details

Our models are trained based on the description in SI [37]; the models are optimized by Adam and the learning rate is 0.001 for all the tested datasets. The batch size is 256 for the permuted MNIST and the split CIFAR-10/CIFAR-100, and 128 for the split Tiny ImageNet while the numbers of epochs are 20 for the permuted MNIST and 60 for the other two datasets. We plan to release our source code and raw results for better reproducibility.

5.3 Results on the Permuted MNIST

We present the results from all compared algorithms on 30 and 100 tasks of the permuted MNIST dataset222For better visualization, we omit results from SGD, SGD+dropout, IMM and MAS because their accuracies are significantly lower than the presented ones.. The network architectures used in this experiment are simple multi-layer perceptrons composed of two hidden layers with additional ReLU and softmax output layers. Dropout is not used except for SGD+dropout.

As Figure 5 shows, ALASSO outperforms all compared methods with large margins (about 15% point at least) for both 30 and 100 tasks. Note that all other methods undergo large performance drops as the number of tasks increases while ALASSO is more robust than others; it presents only moderate performance loss even after learning 100 tasks.

To verify the effectiveness of the asymmetric loss overestimation on the unobserved side and the accurate quadratic approximation on the observable one, we perform the ablation study to analyze their benefit. Figure 5 also illustrates the clear contribution of the components.

Figure 7 shows that the accuracy of ALASSO is stable compared to the other methods. The overall performance of ALASSO is not affected by parameter decoupling but rather improves accuracy slightly as illustrated in Figure 7. Note that it facilitates the fast convergence of the models.

5.4 Results from the Split CIFAR-10/CIFAR-100

We evaluate our algorithm on a more realistic scenario using the split CIFAR-10/CIFAR-100 dataset. In this experiment, each task is composed of 3 classes and the number of classes in a task gets larger as the index of the task increases. Our model is based on a convolutional neural network with 4 convolutional layers followed by 2 fully connected layers with dropouts; ReLU and $2\times 2$ max pooling are also employed to add nonlinearity in the model. Figure 8 demonstrates the results from single_task (to show the upper-bound accuracy), SI, and ALASSO on the split CIFAR-10/CIFAR-100 dataset. The proposed method outperforms SI consistently even on this more realistic dataset and achieves 5.7% point better than SI in average; the accuracy of SI drops by 13.9% point and 41% of the lost accuracy in SI is recovered by ALASSO. Note that ALASSO even shows the comparable performance to single_task method in some cases.

5.5 Results from the Split Tiny ImageNet

We conduct an experiment with a larger network on the split Tiny ImageNet dataset, which has more classes and consists of the tasks with 6 classes. The model has 5 convolutional layers followed by 2 fully connected layers with dropouts while $2\times 2$ max pooling. ReLU activation function are employed as well to add nonlinearity of the network.

Figure 9 illustrates the results from single_task, SI, and ALASSO. Note that the overall accuracy of ALASSO is as competitive to the single_task, which is supposed to show a practical upper-bound performance conceptually. In reality, the average validation accuracy of our method is 59.4% and even higher than single_task accuracy, 58.9%. ALASSO also outperforms SI by approximately $10\%$ point in average.

5.6 More Analysis

The primary hyperparameter in our algorithm is $a$ , which is introduced to overestimate the unobserved side of the loss function. Since $a$ determines the factor of overestimation, it is reasonable to set its value larger than 1.0. We determine the value of $a$ empirically in the permuted MNIST dataset; we choose the similar values for the other datasets, and they are fixed within each dataset. According to our experience, the overall performance is not sensitive to a wide range of $a$ ’s value as presented in Table 1. if $a$ is 1, loss function is approximated by symmetric function. Note that the performance of our algorithm degrades when $a=0.8~{}(<1)$ . Another hyperparameter $c$ balances between the losses in the current and the past task, and it is set to 1.0 throughout the experiments.

Although our original formulation uses no additional memory, the performance of ALASSO is compared with VCL [27], which is a method with episodic memory, in the same condition. Figure 10 illustrates that our algorithm still achieve outstanding performance.

Table 2 present accuracy ( $A$ ), forgetting ( $F$ ) and intransigence ( $I$ ) measures of EWC, SI, and ALASSO on the permuted MNIST and the split CIFAR-10/CIFAR-100, where the subscripts besides $A$ , $F$ , and $I$ denote the number of tasks when the performance is computed. Note that the accuracy measure is most comprehensive and is conceptually correlated to the other two ones. ALASSO outperforms EWC and SI, especially in terms of accuracy and forgetting measures.

6 Conclusion

We presented a novel continual learning framework based on the overestimated asymmetric loss approximation with the better parametrization for the quadratic approximation, which is a carefully designed generalized version of SI. Our algorithm alleviates the catastrophic forgetting issue, which is common in deep neural networks, and is particularly helpful for the scenario with a large number of tasks in continual learning. The proposed solution is motivated by the observation that network parameter updates do not affect target loss function symmetrically, and does not incur substantial side effects. It achieves the state-of-the-art performance on several challenging standard benchmark datasets.

Acknowledgments

This work was supported by the National Research Foundation (NRF) grant funded by the Korea Government (MSIT) (NRF2017R1A2B2011862).

G Additional comment of $\mathcal{L}^{n}(\theta_{k})$

This section presents loss for the current task $\mathcal{L}^{n}(\theta_{k})$ , which is a component dependent on the parameter $\theta_{k}$ in loss $\mathcal{L}^{n}(\theta)$ . The loss $\mathcal{L}^{n}(\theta)$ can be expressed by the sum of $\mathcal{L}^{n}(\theta_{k})$ :

[TABLE]

$\mathcal{L}^{n}(\theta_{k})$ can be interpreted as the parameter specific contribution to the total loss $\mathcal{L}^{n}(\theta)$ .

H Derivation of $\hat{\Omega}_{k}^{n}$ in Eq. (3) and (7) in the main paper

Assuming that the surrogate loss function for each parameter up to the $n^{\text{th}}$ task, $\mathcal{L}_{s}^{n}(\theta_{k})$ , is defined by a quadratic function, which is further decomposed of the two terms as

[TABLE]

where $\mathcal{L}^{n}\left(\theta_{k}\right)$ is the loss for the current task, $\mathcal{L}_{s}^{n-1}(\theta_{k})$ is the surrogate loss function up to $\left(n-1\right)^{\text{th}}$ task, and $c$ is a hyperparameter to balance between the two terms.

After completing learning of the $n^{\text{th}}$ task, we can further assume that we already have $n^{\text{th}}$ loss function $\mathcal{L}^{n}\left(\theta_{k}\right)$ and the new model parameter $\hat{\theta}^{n}_{k}$ given $\hat{\Omega}^{n-1}_{k}$ and $\hat{\theta}^{n-1}_{k}$ from the previous iteration. Then, we can derive the value of $\hat{\Omega}^{n}_{k}$ satisfying Eq. (A.14) in SI and ALASSO based on the following procedures.

H.1 Synaptic Intelligence [37]

Eq.(A.15) shows that the $\hat{\Omega}^{n}_{k}$ obtained by the SI [37] method is not accurate approximation for quadratic surrogate loss function.

[TABLE]

where $\omega^{n}_{k}$ denotes the difference between the losses for the task $n$ before and after training the task , $\omega^{n}_{k}=\mathcal{L}^{n}\left(\hat{\theta}_{k}^{n-1}\right)-\mathcal{L}^{n}\left(\hat{\theta}_{k}^{n}\right)$

H.2 ALASSO

Eq. (A.16) shows that the $\hat{\Omega}^{n}_{k}$ obtained by ALASSO provides the perfect quadratic approximation.

[TABLE]

where $\omega^{n}_{k}$ denotes the difference between the losses for the task $n$ before and after training the task, $\omega^{n}_{k}=\mathcal{L}^{n}\left(\hat{\theta}_{k}^{n-1}\right)-\mathcal{L}^{n}\left(\hat{\theta}_{k}^{n}\right).\$ $\omega_{k}^{1:(n-1)}$ denotes the difference between the surrogate losses of previous tasks for the task $n$ before and after training the task, $\omega_{k}^{1:(n-1)}=c\left(\mathcal{L}^{n-1}_{s}\left(\hat{\theta}_{k}^{n-1}\right)-\mathcal{L}_{s}^{n-1}\left(\hat{\theta}_{k}^{n}\right)\right)$ .

I Additional comparison with other methods

In this subsection we show the comparative experimental results with more algotithms on the permuted MNIST 30 tasks. The additionally compared with the existing state-of-the-art algorithms include SI [37], EWC [17], VCL [27], MAS [2], IMM [21], naïve stochastic gradient descent (SGD) [29, 16], SGD with dropout (SGD+dropout) [10]. Figure K and Figure L show the accuracy and standard deviation of each algorithm over the number of tasks. The Multi task as another upper bound means a simple multi-task learning with a mini-batch mixing data of all the previous tasks and the current task. According to experiment in Figure K, our algorithm achieves the best performance among all compared methods with large margins (about 15% point at least). Besides, Figure L shows that our algorithm is stable in the sense that the performance variation across tasks are small, compared to all other methods, even though the learning process is incremental.

J Configuration of the experimental architectures

The detail network architecture configuration information for our experiments on the Permuted MNIST ( Table. C) / Split CIFAR-10,CIFAR-100 (Table. D) /Tiny ImageNet (Table. E) are given below.

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Tiny Image Net Visual Recognition Challenge. Available at tiny-imagenet.herokuapp.com.
2[2] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision , 2018.
3[3] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017.
4[4] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning , 2015.
5[5] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. Streaming variational bayes. In Advances in Neural Information Processing Systems , 2013.
6[6] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision , 2018.
7[7] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. ar Xiv preprint ar Xiv:1701.08734 , 2017.
8[8] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences , 1999.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Continual Learning by Asymmetric Loss Approximation

Abstract

1 Introduction

2 Related Work

2.1 Architectural Approaches

2.2 Functional Approaches

2.3 Structural Regularization Approaches

3 Synaptic Intelligence (SI)

3.1 Quadratic Approximation of Loss

3.2 Underestimation of Loss

4 Proposed Algorithm

4.1 Overview

4.2 Asymmetric Loss Approximation

4.3 Accurate Quadratic Approximation

4.4 Parameter Decoupling

4.5 Discussion

5 Experiments

5.1 Datasets and Algorithms

5.2 Training Details

5.3 Results on the Permuted MNIST

5.4 Results from the Split CIFAR-10/CIFAR-100

5.5 Results from the Split Tiny ImageNet

5.6 More Analysis

6 Conclusion

Acknowledgments

G Additional comment of Ln(θk)\mathcal{L}^{n}(\theta_{k})Ln(θk​)

H Derivation of Ω^kn\hat{\Omega}_{k}^{n}Ω^kn​ in Eq. (3) and (7) in the main paper

H.1 Synaptic Intelligence [37]

H.2 ALASSO

I Additional comparison with other methods

J Configuration of the experimental architectures

G Additional comment of $\mathcal{L}^{n}(\theta_{k})$

H Derivation of $\hat{\Omega}_{k}^{n}$ in Eq. (3) and (7) in the main paper