Weight Friction: A Simple Method to Overcome Catastrophic Forgetting and   Enable Continual Learning

Gabrielle K. Liu

arXiv:1908.01052·cs.LG·August 20, 2019

Weight Friction: A Simple Method to Overcome Catastrophic Forgetting and Enable Continual Learning

Gabrielle K. Liu

PDF

Open Access

TL;DR

This paper introduces weight friction, a simple and efficient method inspired by neurology and physics, to prevent catastrophic forgetting in neural networks and facilitate continual learning across multiple tasks.

Contribution

The paper proposes weight friction, a novel modification to gradient descent, enabling neural networks to learn sequential tasks without forgetting, with improved efficiency.

Findings

01

Performs comparably to existing methods in preventing forgetting.

02

Operates efficiently with lower computational and memory costs.

03

Converges at a rate similar to stochastic gradient descent.

Abstract

In recent years, deep neural networks have found success in replicating human-level cognitive skills, yet they suffer from several major obstacles. One significant limitation is the inability to learn new tasks without forgetting previously learned tasks, a shortcoming known as catastrophic forgetting. In this research, we propose a simple method to overcome catastrophic forgetting and enable continual learning in neural networks. We draw inspiration from principles in neurology and physics to develop the concept of weight friction. Weight friction operates by a modification to the update rule in the gradient descent optimization method. It converges at a rate comparable to that of the stochastic gradient descent algorithm and can operate over multiple task domains. It performs comparably to current methods while offering improvements in computation and memory efficiency.

Tables1

Table 1. Table 1: The behavior of weight friction.

$\| w \|$	$\| g (w) \|$	Weight Friction
Large	Close to 0	Large
Small	Close to 1	Small

Equations61

F_{f r i c t i o n} = μ m g,

F_{f r i c t i o n} = μ m g,

w = w - α g (w) \frac{\partial L}{\partial w},

w = w - α g (w) \frac{\partial L}{\partial w},

g (w) = \frac{4 e ^{μ w}}{( 1 + e ^{μ w} ) ^{2}},

g (w) = \frac{4 e ^{μ w}}{( 1 + e ^{μ w} ) ^{2}},

w arg min L (w) = w^{*},

w arg min L (w) = w^{*},

w_{t + 1} = w_{t} - α g (w_{t}) \nabla_{w} L (w_{t}),

w_{t + 1} = w_{t} - α g (w_{t}) \nabla_{w} L (w_{t}),

g (w_{t}) = \frac{4 e ^{μ w_{t}}}{( 1 + e ^{μ w_{t}} ) ^{2}}

g (w_{t}) = \frac{4 e ^{μ w_{t}}}{( 1 + e ^{μ w_{t}} ) ^{2}}

R_{L} (T) = t = 1 \sum T (L (w_{t}) - L (w^{*})) .

R_{L} (T) = t = 1 \sum T (L (w_{t}) - L (w^{*})) .

L (y) \geq L (x) + ⟨ \nabla L (x), y - x ⟩ \forall x, y .

L (y) \geq L (x) + ⟨ \nabla L (x), y - x ⟩ \forall x, y .

∣∣\nabla L (x) - \nabla L (y) ∣∣ \leq L ∣∣ x - y ∣∣

∣∣\nabla L (x) - \nabla L (y) ∣∣ \leq L ∣∣ x - y ∣∣

⟹ L (y) \leq L (x) + ⟨ \nabla L (x), y - x ⟩ + \frac{L}{2} ∣∣ y - x ∣ ∣^{2}

⟹ L (y) \leq L (x) + ⟨ \nabla L (x), y - x ⟩ + \frac{L}{2} ∣∣ y - x ∣ ∣^{2}

∣∣ w_{t} ∣ ∣^{2} - 2 ⟨ w_{t}, w^{*} ⟩ + ∣∣ w^{*} ∣ ∣^{2} - 2 α g ⟨ w_{t}, \nabla_{w} L (w_{t})⟩ + 2 α g ⟨ \nabla_{w} L (w_{t}), w^{*} ⟩ + α^{2} g^{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} .

∣∣ w_{t} ∣ ∣^{2} - 2 ⟨ w_{t}, w^{*} ⟩ + ∣∣ w^{*} ∣ ∣^{2} - 2 α g ⟨ w_{t}, \nabla_{w} L (w_{t})⟩ + 2 α g ⟨ \nabla_{w} L (w_{t}), w^{*} ⟩ + α^{2} g^{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} .

∣∣ w_{t} ∣ ∣^{2} - 2 ⟨ w_{t}, α g \nabla_{w} L (w_{t})⟩ + ∣∣ α g \nabla_{w} L (w_{t}) ∣ ∣^{2} - 2 ⟨ w_{t}, w^{*} ⟩ - 2 ⟨ - α g \nabla_{w} L (w_{t}), w^{*} ⟩ + ∣∣ w^{*} ∣ ∣^{2} .

∣∣ w_{t} ∣ ∣^{2} - 2 ⟨ w_{t}, α g \nabla_{w} L (w_{t})⟩ + ∣∣ α g \nabla_{w} L (w_{t}) ∣ ∣^{2} - 2 ⟨ w_{t}, w^{*} ⟩ - 2 ⟨ - α g \nabla_{w} L (w_{t}), w^{*} ⟩ + ∣∣ w^{*} ∣ ∣^{2} .

∣∣ a - b ∣ ∣^{2} = ∣∣ a ∣ ∣^{2} - 2 ⟨ a, b ⟩ + ∣∣ b ∣ ∣^{2} .

∣∣ a - b ∣ ∣^{2} = ∣∣ a ∣ ∣^{2} - 2 ⟨ a, b ⟩ + ∣∣ b ∣ ∣^{2} .

∣∣ w_{t} - α g \nabla_{w} L (w_{t}) ∣ ∣^{2} - 2 ⟨ w_{t} - α g \nabla_{w} L (w_{t}), w^{*} ⟩ + ∣∣ w^{*} ∣ ∣^{2} .

∣∣ w_{t} - α g \nabla_{w} L (w_{t}) ∣ ∣^{2} - 2 ⟨ w_{t} - α g \nabla_{w} L (w_{t}), w^{*} ⟩ + ∣∣ w^{*} ∣ ∣^{2} .

∣∣ (w_{t} - α g \nabla_{w} L (w_{t})) - w^{*} ∣ ∣^{2} .

∣∣ (w_{t} - α g \nabla_{w} L (w_{t})) - w^{*} ∣ ∣^{2} .

∣∣ w_{t + 1} - w^{*} ∣ ∣^{2} .

∣∣ w_{t + 1} - w^{*} ∣ ∣^{2} .

R_{L} (T) = O (∣∣ w_{1} - w^{*} ∣ ∣^{2}) .

R_{L} (T) = O (∣∣ w_{1} - w^{*} ∣ ∣^{2}) .

L (w_{t + 1}) \leq L (w_{t}) + ⟨ \nabla_{w} L (w_{t}), w_{t + 1} - w_{t} ⟩ + \frac{L}{2} ∣∣ w_{t + 1} - w_{t} ∣ ∣^{2}

L (w_{t + 1}) \leq L (w_{t}) + ⟨ \nabla_{w} L (w_{t}), w_{t + 1} - w_{t} ⟩ + \frac{L}{2} ∣∣ w_{t + 1} - w_{t} ∣ ∣^{2}

L (w_{t + 1}) \leq L (w_{t}) + ⟨ \nabla_{w} L (w_{t}), - α g (w_{t}) \nabla_{w} L (w_{t})⟩ + \frac{L}{2} ∣∣ - α g (w_{t}) \nabla_{w} L (w_{t}) ∣ ∣^{2} = L (w_{t}) - α g (w_{t}) ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} + \frac{L}{2} (α g (w_{t}))^{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} = L (w_{t}) - α g (w_{t}) (1 - \frac{L}{2} α g (w_{t})) ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2}

L (w_{t + 1}) \leq L (w_{t}) + ⟨ \nabla_{w} L (w_{t}), - α g (w_{t}) \nabla_{w} L (w_{t})⟩ + \frac{L}{2} ∣∣ - α g (w_{t}) \nabla_{w} L (w_{t}) ∣ ∣^{2} = L (w_{t}) - α g (w_{t}) ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} + \frac{L}{2} (α g (w_{t}))^{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} = L (w_{t}) - α g (w_{t}) (1 - \frac{L}{2} α g (w_{t})) ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2}

L (w_{t + 1}) \leq L (w_{t}) - α g (w_{t}) (1 - \frac{1}{2} g (w_{t})) ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} .

L (w_{t + 1}) \leq L (w_{t}) - α g (w_{t}) (1 - \frac{1}{2} g (w_{t})) ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} .

g (w) = \frac{4 e ^{μ w_{t}}}{( 1 + e ^{μ w_{t}} ) ^{2}} \leq 1 \forall w and \forall μ > 0

g (w) = \frac{4 e ^{μ w_{t}}}{( 1 + e ^{μ w_{t}} ) ^{2}} \leq 1 \forall w and \forall μ > 0

L (w_{t + 1}) \leq L (w_{t}) - α g (w_{t}) \frac{1}{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} .

L (w_{t + 1}) \leq L (w_{t}) - α g (w_{t}) \frac{1}{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} .

L (w^{*}) \geq L (w_{t}) + ⟨ \nabla_{w} L (w_{t}), w^{*} - w_{t} ⟩ .

L (w^{*}) \geq L (w_{t}) + ⟨ \nabla_{w} L (w_{t}), w^{*} - w_{t} ⟩ .

L (w_{t}) \leq L (w^{*}) + ⟨ \nabla_{w} L (w_{t}), w_{t} - w^{*} ⟩,

L (w_{t}) \leq L (w^{*}) + ⟨ \nabla_{w} L (w_{t}), w_{t} - w^{*} ⟩,

L (w_{t + 1}) \leq L (w^{*}) + ⟨ \nabla_{w} L (w_{t}), w_{t} - w^{*} ⟩ - \frac{α g}{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} = L (w^{*}) + ⟨ \nabla_{w} L (w_{t}), w_{t} - w^{*} ⟩ - \frac{α g}{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} + \frac{1}{2 α g} (∣∣ w_{t} - w^{*} ∣ ∣^{2} - ∣∣ w_{t} - w^{*} ∣ ∣^{2}) = L (w^{*}) + ⟨ \nabla_{w} L (w_{t}), w_{t} - w^{*} ⟩ - \frac{α g}{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} + \frac{1}{2 α g} (∣∣ w_{t} - w^{*} ∣ ∣^{2} - (∣∣ w_{t} ∣ ∣^{2} - 2 ⟨ w_{t}, w^{*} ⟩ + ∣∣ w^{*} ∣ ∣^{2})) .

L (w_{t + 1}) \leq L (w^{*}) + ⟨ \nabla_{w} L (w_{t}), w_{t} - w^{*} ⟩ - \frac{α g}{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} = L (w^{*}) + ⟨ \nabla_{w} L (w_{t}), w_{t} - w^{*} ⟩ - \frac{α g}{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} + \frac{1}{2 α g} (∣∣ w_{t} - w^{*} ∣ ∣^{2} - ∣∣ w_{t} - w^{*} ∣ ∣^{2}) = L (w^{*}) + ⟨ \nabla_{w} L (w_{t}), w_{t} - w^{*} ⟩ - \frac{α g}{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2} + \frac{1}{2 α g} (∣∣ w_{t} - w^{*} ∣ ∣^{2} - (∣∣ w_{t} ∣ ∣^{2} - 2 ⟨ w_{t}, w^{*} ⟩ + ∣∣ w^{*} ∣ ∣^{2})) .

L (w_{t + 1}) \leq L (w^{*}) + \frac{1}{2 α g} (∣∣ w_{t} - w^{*} ∣ ∣^{2} - (∣∣ w_{t} ∣ ∣^{2} - 2 ⟨ w_{t}, w^{*} ⟩ + ∣∣ w^{*} ∣ ∣^{2} - 2 α g ⟨ \nabla_{w} L (w_{t}), w_{t} - w^{*} ⟩ + α^{2} g^{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2})) .

L (w_{t + 1}) \leq L (w^{*}) + \frac{1}{2 α g} (∣∣ w_{t} - w^{*} ∣ ∣^{2} - (∣∣ w_{t} ∣ ∣^{2} - 2 ⟨ w_{t}, w^{*} ⟩ + ∣∣ w^{*} ∣ ∣^{2} - 2 α g ⟨ \nabla_{w} L (w_{t}), w_{t} - w^{*} ⟩ + α^{2} g^{2} ∣∣ \nabla_{w} L (w_{t}) ∣ ∣^{2})) .

⟹ L (w_{t + 1}) \leq L (w^{*}) + \frac{1}{2 α g} (∣∣ w_{t} - w^{*} ∣ ∣^{2} - ∣∣ w_{t + 1} - w^{*} ∣ ∣^{2}) L (w_{t + 1}) - L (w^{*}) \leq \frac{1}{2 α g} (∣∣ w_{t} - w^{*} ∣ ∣^{2} - ∣∣ w_{t + 1} - w^{*} ∣ ∣^{2}) .

⟹ L (w_{t + 1}) \leq L (w^{*}) + \frac{1}{2 α g} (∣∣ w_{t} - w^{*} ∣ ∣^{2} - ∣∣ w_{t + 1} - w^{*} ∣ ∣^{2}) L (w_{t + 1}) - L (w^{*}) \leq \frac{1}{2 α g} (∣∣ w_{t} - w^{*} ∣ ∣^{2} - ∣∣ w_{t + 1} - w^{*} ∣ ∣^{2}) .

R_{L} (T)

R_{L} (T)

\leq \frac{1}{2 α g} t = 1 \sum T [∣∣ w_{t} - w^{*} ∣ ∣^{2} - ∣∣ w_{t + 1} - w^{*} ∣ ∣^{2}]

= \frac{1}{2 α g} [∣∣ w_{1} - w^{*} ∣ ∣^{2} - ∣∣ w_{T + 1} - w^{*} ∣ ∣^{2}]

\leq \frac{1}{2 α g} ∣∣ w_{1} - w^{*} ∣ ∣^{2}

R_{L} (T) \leq \frac{1}{2 α g} ∣∣ w_{1} - w^{*} ∣ ∣^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition

Full text

Weight Friction: A Simple Method to Overcome Catastrophic Forgetting and Enable Continual Learning

Gabrielle K. Liu

[email protected]

Abstract

In recent years, deep neural networks have found success in replicating human-level cognitive skills, yet they suffer from several major obstacles. One significant limitation is the inability to learn new tasks without forgetting previously learned tasks, a shortcoming known as catastrophic forgetting. In this research, we propose a simple method to overcome catastrophic forgetting and enable continual learning in neural networks. We draw inspiration from principles in neurology and physics to develop the concept of weight friction. Weight friction operates by a modification to the update rule in the gradient descent optimization method. It converges at a rate comparable to that of the stochastic gradient descent algorithm and can operate over multiple task domains. It performs comparably to current methods while offering improvements in computation and memory efficiency.

1 Introduction

In recent years, deep neural networks have found success in various applications of artificial intelligence [LBH15, DY+14, Sch15]. They have achieved vast improvements in areas ranging from speech recognition to cancer detection. These and other benefits of deep learning can be expected to provide significant improvements to the quality of human life. While neural networks hold great promise for achieving human-level intelligence, there remain several fundamental obstacles to replicating human cognitive skills and achieving strong AI.

One of the most significant limitations is that neural networks lack the ability to continually learn: they struggle to retain previously acquired knowledge and experience after learning to perform new tasks. The ability to perform continual learning is essential for computational learning systems to achieve human-level intelligence. To realize continual learning in neural networks, we must overcome a key shortcoming known as catastrophic forgetting.

Catastrophic forgetting refers to when a neural network cannot learn tasks sequentially without “forgetting” how to perform previously learned tasks [Goo+13]. It arises as a consequence of a neural network’s inability to strike a balance between plasticity (the ability to adapt to new tasks and learn new information) and stability (the ability to preserve previously learned important information) [Rob95]. This phenomenon is known as the stability-plasticity dilemma. Ideally, neural networks should be able to generalize and apply previous knowledge to new tasks by learning representations that are applicable across a wide variety of domains. In reality, the excessive plasticity of neural networks leads to catastrophic forgetting, which prevents continual learning.

In this paper, we propose weight friction as an simple, effective, and efficient method to address this challenge. We develop an update rule for gradient descent in neural networks that allows weight values to become more resistant to change (less plasticity) as the network learns new tasks. In this way, it becomes possible for neural networks to overcome catastrophic forgetting and thereby learn in a continual fashion.

The remainder of this paper is organized as follows. In Section 2, we briefly discuss related work. In Section 3.1, we introduce the motivation and intuition behind the concept of weight friction. In Section 3.2, we analyze the rate of convergence for training neural networks with weight friction. In Section 3.3, we evaluate the effect of weight friction on several continual learning settings. In Section 4, we discuss the relevance and importance of our results. Finally, in Section 5, we review our conclusions and discuss avenues for future work.

2 Related work

Catastrophic forgetting has been a well-known problem in neural networks since the 1980s [Kir+17]. Many approaches to overcome catastrophic forgetting and facilitate transfer learning have been proposed. In general, these methods exhibit issues such as computational complexity, poor scalability, increased training time, and difficulty of implementation. One approach, called Progressive Neural Networks (PNNs), instantiates a new neural network for each task in consideration and uses lateral connections between the hidden layers of the networks to transfer knowledge [Rus+16]. The creation of a separate neural network for each task significantly increases computation time and memory use, thus making PNNs not easily scalable to a large number of tasks. Another method, Gradient Episodic Memory (GEM), uses an episodic memory for each task in consideration and avoids catastrophic forgetting by minimizing both the loss for each task and the losses on the episodic memories for previous tasks [LP+17]. While GEM yields appreciable gains in performance, it also results in a large computational burden during training. To address this problem, a method has been proposed that minimizes the average episodic memory loss rather than the episodic memory loss for each previous task [Cha+18]. This modified version of GEM is known as Averaged GEM (A-GEM) and is significantly more computationally efficient but still requires an episodic memory to be maintained. Elastic Weight Consolidation (EWC) is a method that seeks to overcome catastrophic forgetting by regularizing the loss with importance parameters based on a Laplace approximation to the Fisher information matrix calculated at the end of each task [Kir+17]. Another method applies the Benna-Fusi model to perform Weight consolidation over a range of time scales [KSC18], but it requires storing the time and magnitude of every parameter update. We offer weight friction as a method to overcome catastrophic forgetting that avoids many of the aforementioned issues while offering substantial improvements in computational efficiency.

3 Weight friction

3.1 Motivation

We propose a mechanism for avoiding catastrophic forgetting in neural networks that is analogous to a mechanism in the human brain and the concept of friction in physics. There is evidence that the human brain can prevent catastrophic forgetting by protecting knowledge in neocortical circuits [Seg10]. When learning occurs, a portion of the excitatory synapses in the brain is strengthened, and there is a resulting enlargement of dendritic spines (small protrusions covering the surface of a neuron’s dendrites that receive input from excitatory synapses) [Seg10, NSS02]. Compared to small spines, which are transient and easily erased, enlarged spines are persistent and aid in memory retention. The resistance of a spine to erasure is proportional to its volume.

In physics, a similar relationship exists between mass and frictional resistance: as the mass of an object in motion across a horizontal surface increases, the frictional resistance experienced by the object increases as well. More generally, the magnitude of the friction force is proportional to the object’s mass, as demonstrated in the following equation for friction force:

[TABLE]

where $\mu$ is the coefficient of friction, $m$ is the mass of the object, and $g$ is the acceleration due to gravity.

Drawing a parallel from these principles in neurology and physics to neural networks, we propose that weights that are relatively large in magnitude should experience more resistance to change, and those that are relatively small in magnitude should experience less resistance to change. In this way, weights of larger magnitude (strong memories) are preserved, while weights of smaller magnitude (weak memories) can be overwritten.

To better understand this, let us first define the normal update as the magnitude of the update to a weight when there is no weight friction. When weight friction is present, larger-magnitude weights should have smaller-than-normal updates (more resistance to change), and smaller-magnitude weights should have close-to-normal updates (less resistance to change). We achieve this by modifying the update rule for gradient descent in neural networks. In traditional gradient descent, each weight $w$ is updated by some proportion of the gradient of the loss with respect to the weight itself. With weight friction, we include a multiplicative factor $g(w)$ that is inversely proportional to the magnitude of $w$ :

[TABLE]

where $\alpha$ is the learning rate. To model this effect, we would like $g(w)$ to resemble a Gaussian curve, with the horizontal spread of the curve scaled by a constant $\mu$ and corresponding to the threshold for a weight’s magnitude to be defined as “large.” Note that we can also choose to vary the value of $\mu$ during training.

For the purposes of this paper, we assume

[TABLE]

with the graph of this function for $\mu=1$ shown in Figure 2. However, note that $g(w)$ can be assigned to any function that behaves similarly. For example, we can also assign $g(w)=e^{-\mu x^{2}}$ , whose graph for $\mu=1$ is shown in Figure 2. To better understand the effect of weight friction in relation to the magnitudes of $w$ and $g(w)$ , we can look to Table 1.

Notice that the use of weight friction does not modify the loss function itself. Additionally, since $g(w)$ is multiplicative, we can alternatively think of weight friction as a weight-based adaptive learning rate method, whereby the learning rate $\alpha$ is scaled inversely to the magnitude of the weight itself.

3.2 Convergence

We show that our weight friction method converges to an optimal solution for the loss function $\mathcal{L}(w)$ at a rate comparable to that of stochastic gradient descent. Let $w^{*}\in W$ denote the optimal solution to $\min\mathcal{L}(w)$ , with

[TABLE]

and let $\nabla_{w}\mathcal{L}(w_{t})$ denote the gradient of the loss function with respect to $w_{t}$ , the parameter vector at the current step $t$ . We define the update rule for gradient descent with weight friction to be

[TABLE]

with

[TABLE]

and $\mu$ a nonnegative value held constant for the purpose of this proof. To analyze the rate of convergence of gradient descent with weight friction, we would like to find a bound on the regret $R_{\mathcal{L}}(T)$ , which is the sum of the differences between $\mathcal{L}(w_{t})$ and $\mathcal{L}(w^{*})$ , at each time step $t\in[1,T]$ :

[TABLE]

As weight friction does not alter the loss function $\mathcal{L}$ , the method is guaranteed to converge as long as $\mathcal{L}$ satisfies the following two properties [Kim+17]:

•

$\mathcal{L}$ is a convex function. That is,

[TABLE]

•

$\nabla\mathcal{L}$ is Lipschitz continuous. That is, there exists some constant $L$ such that for all $x$ and $y$ ,

[TABLE]

Without loss of generality, for our analysis of convergence, we assume these two properties hold for $\mathcal{L}$ . We now present a lemma to be used in a subsequent proof.

Lemma 1.

The expression $||w_{t}||^{2}-2\langle w_{t},w^{*}\rangle+||w^{*}||^{2}-2\alpha g\langle\nabla_{w}\mathcal{L}(w_{t}),w_{t}-w^{*}\rangle+\alpha^{2}g^{2}||\nabla_{w}\mathcal{L}(w_{t})||^{2}$ is equivalent to $||w_{t+1}-w^{*}||^{2}$ .

Proof.

We begin by separating the fourth term of the initial expression into two terms and apply the commutativity of the dot product to obtain

[TABLE]

Next, we rearrange by applying the associativity of the dot product over multiplication by the scalar $\alpha g$ , and the expression becomes:

[TABLE]

Now observe that for any two vectors $a$ and $b$ , we have

[TABLE]

We can apply this to simplify the previous expression, which becomes

[TABLE]

This is equivalent to

[TABLE]

Finally, we complete the proof by substituting $w_{t}-\alpha g\nabla_{w}\mathcal{L}(w_{t})=w_{t+1}$ , to obtain as desired:

[TABLE]

∎

We now present a key theorem on the rate of convergence of training with weight friction.

Theorem 2.

If $\mathcal{L}(w)$ is convex and its gradient $\nabla_{w}\mathcal{L}(w)$ is $L$ -Lipschitz continuous, then for $\alpha\in\left(0,\frac{1}{L}\right]$ and $\mu\in\mathbb{R}_{++}$ , the sequence ${w_{t}}$ generated by Equation 1 satisfies

[TABLE]

Proof.

We begin by substituting $w_{t+1}$ and $w_{t}$ for $y$ and $x$ , respectively, into Equation 4:

[TABLE]

We can expand this expression by substituting $w_{t+1}-w_{t}=-\alpha g(w_{t})\nabla_{w}\mathcal{L}(w_{t}):$

[TABLE]

Since we assume $\alpha\in\left(0,\frac{1}{L}\right]$ , we know $\alpha\leq\frac{1}{L}$ . Thus, we can substitute $\frac{1}{L}$ for $\alpha$ while still preserving the inequality:

[TABLE]

We also know that

[TABLE]

and thus similarly substitute $g(w)=1$ while preserving the inequality. Equation 5 becomes

[TABLE]

In subsequent steps of this proof we denote $g(w_{t})$ as $g$ . Now we know because $\mathcal{L}$ is convex that

[TABLE]

Rearranging this expression gives

[TABLE]

and substituting the right-hand side of this inequality for $\mathcal{L}(w_{t})$ in Equation 6 gives

[TABLE]

Expanding the three terms that follow $\mathcal{L}(w^{*})$ in the last expression and rearranging, we have

[TABLE]

We apply Lemma 1 to simplify the left-hand side of this expression, which yields

[TABLE]

Lastly, summing the left-hand side of Equation 3.2 for all $t$ , we are able to derive a bound on the regret $R_{\mathcal{L}}(T)$ , as desired:

[TABLE]

Thus, we have

[TABLE]

which means that training with weight friction converges at a rate on the order of $\mathcal{O}\left(||w_{1}-w^{*}||^{2}\right)$ , similar to that of training with stochastic gradient descent [Kim+17]. ∎

3.3 Empirical analysis

We defined three experimental settings to analyze the effects of weight friction.

Settings 1 and 2 were based on the image classification benchmark datasets MNIST and Fashion-MNIST [LCB98, XRV17]. MNIST consists of 70,000 grayscale images sized 28x28, each showing a single, centered handwritten digit from one of 10 classes (0-9). Fashion-MNIST consists of 70,000 grayscale images sized 28x28, each showing a single, centered article of clothing from one of 10 classes. For each dataset, we used the provided 10,000-example test set and randomly sampled the remaining examples according to an 80/20 train/validation split.

In Setting 1, a neural network was first trained on the MNIST dataset to perform the task of handwritten digit classification. Then, the same neural network continued to train on the Fashion-MNIST dataset to perform the task of object classification. Lastly, with no further training, we re-evaluated the model’s performance on the first task on MNIST.

Setting 2 was identical to Setting 1, but with models trained initially on Fashion-MNIST, then trained on MNIST second, and finally re-evaluated on Fashion-MNIST.

Setting 3 was based on Permuted MNIST [Goo+13], a variation of MNIST. With Permuted MNIST, new tasks of comparable difficulty to the original MNIST classification task are created by permuting the pixels of every image based on a randomly generated permutation. Similar to Settings 1 and 2, in Setting 3, a neural network was trained successively on ten Permuted MNIST tasks, and the model’s performance on previous and current tasks was re-evaluated after training on each new task. Training and test sets were derived from the original 60,000- and 10,000-example training and test sets. No validation data was used in this setting; cross-validation was used for parameter tuning.

In Settings 1-2, model architectures were feedforward neural networks consisting of a 784-neuron input layer followed by three 256-neuron hidden layers with ReLU activation and one 10-neuron output layer with softmax activation. No dropout or regularization was used, and the learning rate was 0.01. Each model was trained until convergence at 50 epochs on the first task and 100 epochs on the second task. For both settings, we trained models both with and without weight friction. Models trained without weight friction used Adam optimization and served as a baseline for comparison.

In Setting 3, model architectures were feedforward neural networks consisting of a 784-neuron input layer, two 256-neuron hidden layers with ReLU activation, and one 10-neuron output layer with softmax activation, with no dropout and no regularization and trained for 5,000 epochs per task. We compared models trained with weight friction (WF) to several models trained with other continual learning methods (EWC, PNN, A-GEM) and to a baseline model trained without weight friction and with Adam optimization (VAN). For VAN, EWC, PNN, and A-GEM, all hyperparameters were identical to those used by [Cha+18] except learning rate, which was tuned to 0.001.

In all settings, models trained with weight friction were optimized over $\mu$ by gridsearch. For all models, weight friction was applied only to tasks following the first; no weight friction was applied to training on the first task in any setting. We used test set accuracy as a metric to evaluate model performance and averaged the results of 10 random initializations to account for variability. All model parameters were initialized with Xavier initialization [GB10].

As shown in Figures 4 and 4, the baseline models trained without weight friction (blue bars) exhibited catastrophic forgetting. Average test accuracy for the first task decreased from 98.08% to 26.09% in Setting 1 and from 89.00% to 13.89% in Setting 2 following training on the second task. In comparison, the models trained with weight friction (orange bars) were better able to learn and remember representations across task domains. Specifically, the average final test accuracy on the first task rose from 26.09% to 83.82% in Setting 1 and from 13.89% to 49.81% in Setting 2. At the same time, the average final test accuracy achieved with weight friction for the second task remained nearly the same as the baseline in both settings. This evidences that weight friction enables neural networks to overcome catastrophic forgetting and learn representations that facilitate continual learning.

Differences in the results of Settings 1 and 2 further suggest that the order in which tasks are learned impacts the efficacy of weight friction. Specifically, in both settings, the model ultimately learned to classify images from both MNIST and Fashion-MNIST. Yet training on MNIST before Fashion-MNIST (Setting 1) led to better results overall, with the model achieving average accuracies of 85.29% and 83.82% on Fashion-MNIST and MNIST, respectively, in Setting 1 versus the 49.81% and 94.95% achieved in Setting 2. It is possible that weight friction allows neural networks to learn representations on simpler tasks that are transferred when learning more complex tasks. This potentially reflects the progression of human learning from simple to complex tasks.

The performance results of Setting 3 show that weight friction is comparable to existing approaches. From Figure 6, it is evident that weight friction is not limited to operate over a small number of task domains and can be applied to a large number of tasks without compromising model accuracy. In terms of average model accuracy, weight friction outperforms EWC as more tasks are learned. Figure 6 indicates that weight friction results in a much lower computation time and memory cost relative to EWC, PNNs, and A-GEM. It is important to note that training a neural network with weight friction resulted in a negligible increase in memory use from vanilla training without weight friction. Furthermore, in terms of computation time during training, weight friction was 2.16 times faster than A-GEM, 1.98 times faster than PNNs, and 1.29 times faster than EWC. Weight friction yielded even greater improvements in terms of memory cost during training, with memory use 35.71 times lower than PNNs, 3.57 times lower than A-GEM, and 3.04 times lower than EWC. While PNNs and A-GEM achieved highest accuracy, they also resulted in the worst efficiency, with the highest memory and computation time costs, respectively. In comparison, weight friction achieved greatest efficiency and relatively high accuracy, which indicates that weight friction offers one of the best tradeoffs between accuracy and efficiency.

4 Discussion

The implications of our results are significant in comparison to those of existing approaches. In particular, weight friction does not suffer from common limitations of current methods for facilitating continual learning. For instance, unlike Progressive Neural Networks, weight friction does not significantly increase computational complexity or the number of model parameters that must be learned, as it does not require the instantiation of a new neural network for each task of interest [Rus+16]. Unlike GEM and A-GEM, it does not require the use of an episodic memory during training [LP+17, Cha+18]. Unlike importance factor-based methods, it does not require prior knowledge or supervision regarding which weights should be preserved, which simplifies training. Furthermore, unlike methods that seek to maximize domain confusion, weight friction is not restricted by design to operate over a limited set of task domains [Tze+14].

The benefits of weight friction are potentially magnified when applied to other types of neural networks. In particular, as the weight friction mechanism depends only on a modification to the update rule for gradient descent, it is inherently applicable to any neural network architecture. This therefore allows us to harness weight friction to overcome catastrophic forgetting and facilitate continual learning in other types of neural networks.

5 Conclusions

In this research, we addressed the problem of catastrophic forgetting in neural networks, which hinders continual learning. Specifically, we drew from principles in neurology and physics to develop the concept of weight friction as a new learning method. We showed that training with weight friction converges at a rate comparable to that of stochastic gradient descent. We showed empirically that neural networks trained with weight friction can potentially learn representations shared across data domains. We further discussed the various benefits of weight friction as an approach that is less complex, more efficient, and more widely applicable in comparison to existing methods. Ultimately, this research takes a step toward building neural networks with continuous learning ability.

There are several areas we would like to pursue in our future work. First, we seek to derive a bound on the information capacity of a neural network trained with weight friction. Namely, how many tasks and of what complexity can such a neural network learn while maintaining a reasonable level of performance for each task? Another avenue for investigation is the application of weight friction to other neural network architectures, including convolutional and recurrent neural networks. We also hope to study the effect of weight friction function $g(w)$ selection on performance. Lastly, we would like to analyze the effect of learning algorithms that combine weight friction, momentum, regularization, and/or normalization.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Cha+18] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach and Mohamed Elhoseiny “Efficient Lifelong Learning with A-GEM” In ar Xiv preprint ar Xiv:1812.00420 , 2018
2[DY+14] Li Deng and Dong Yu “Deep Learning: Methods and Applications” In Foundations and Trends® in Signal Processing 7.3–4 Now Publishers, Inc., 2014, pp. 197–387
3[GB 10] Xavier Glorot and Yoshua Bengio “Understanding the Difficulty of Training Deep Feedforward Neural Networks” In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , 2010, pp. 249–256
4[Goo+13] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville and Yoshua Bengio “An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks” In ar Xiv preprint ar Xiv:1312.6211 , 2013
5[Kim+17] Hyoung Seok Kim, Ji Hoon Kang, Woo Myoung Park, Suk Hyun Ko, Yoon Ho Cho, Dae Sung Yu, Young Sook Song and Jung Won Choi “Convergence Analysis of Optimization Algorithms” In ar Xiv preprint ar Xiv:1707.01647 , 2017
6[Kir+17] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho and Agnieszka Grabska-Barwinska “Overcoming Catastrophic Forgetting in Neural Networks” In Proceedings of the National Academy of Sciences 114.13 National Academy of Sciences, 2017, pp. 3521–3526
7[KSC 18] Christos Kaplanis, Murray Shanahan and Claudia Clopath “Continual Reinforcement Learning with Complex Synapses” In International Conference on Machine Learning , 2018, pp. 2502–2511
8[LBH 15] Yann Le Cun, Yoshua Bengio and Geoffrey Hinton “Deep Learning” In Nature 521.7553 Nature Publishing Group, 2015, pp. 436

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Weight Friction: A Simple Method to Overcome Catastrophic Forgetting and Enable Continual Learning

Abstract

1 Introduction

2 Related work

3 Weight friction

3.1 Motivation

3.2 Convergence

Lemma 1**.**

Proof.

Theorem 2**.**

Proof.

3.3 Empirical analysis

4 Discussion

5 Conclusions

Lemma 1.

Theorem 2.