Elephant Neural Networks: Born to Be a Continual Learner
Qingfeng Lan, A. Rupam Mahmood

TL;DR
This paper introduces elephant activation functions that promote sparse representations and gradients, significantly enhancing neural networks' ability to retain knowledge in continual learning scenarios without additional memory or pre-training.
Contribution
The study identifies the importance of gradient sparsity in reducing catastrophic forgetting and proposes a new class of activation functions that improve continual learning performance.
Findings
Elephant activation functions generate sparse representations and gradients.
Replacing classical activations with elephant functions improves resilience to forgetting.
Achieves strong results on Split MNIST without replay or task boundaries.
Abstract
Catastrophic forgetting remains a significant challenge to continual learning for decades. While recent works have proposed effective methods to mitigate this problem, they mainly focus on the algorithmic side. Meanwhile, we do not fully understand what architectural properties of neural networks lead to catastrophic forgetting. This study aims to fill this gap by studying the role of activation functions in the training dynamics of neural networks and their impact on catastrophic forgetting. Our study reveals that, besides sparse representations, the gradient sparsity of activation functions also plays an important role in reducing forgetting. Based on this insight, we propose a new class of activation functions, elephant activation functions, that can generate both sparse representations and sparse gradients. We show that by simply replacing classical activation functions with…
Peer Reviews
Decision·Submitted to ICLR 2024
1. Introduces sparse activation function
1. Experiment setup and text is slightly misleading, theoretically it is not possible that the activation function alone can resolve forgetting, activation patterns, weights should be parameterized to avoid neural cross talk across talk. The paper would benefit from showing results on deeper architecture and SoTA continual learning approaches. 2. Writing should be improved, keywords such as excellent should be avoided given results are almost 13% lower than SoTA. 3. Ablation study missing 4. Com
This paper studies the interaction between neural network architectures and continual learning which is under-studied. It focuses on activation function - a critical component of neural networks. It is a good problem and sparsity has been called out in some of the earliest continual learning papers in the deep learning era as a good inductive bias for continual learning, e.g., Kemker et al. (AAAI 2018). The paper presents theoretical analysis to demonstrate that sparse representations alone can
In non-rehearsal methods, elephant neural networks (ENNs) demonstrate efficacy. However, it is unknown how ENNs generalize to other performant algorithms e.g., rehearsal based methods. It is unclear if ENNs’ efficacy is specific to the particular non-rehearsal settings (Sec. 5.2) or not! Besides MLP and CNN, efficacy of elephant activation with ViTs is unexplored and necessary to verify its generality. Lack of experiments on high dimensional and large scale datasets e.g., ImageNet-1K. Also it
Ability to control generalisation vs. plasticity through hyper-parameter of an activation function is a great feature for continual learning, though it comes at a cost of more hyper-parameters. The sparsity analysis of activation functions is interesting. The proposed method does not require task boundary information, which is very desired for practical continual learning. Empirical evaluation is honest - it shows that proposed methods does not beat (on accuracy) other baselines, though those
Switch to a symmetric activation function will have significant impact on the internal representation (and perhaps generalisation capabilities) of a neural network. After all, there must be a reason why we typically use asymmetric activation functions like ReLU, sigmoid, tanh. I think any proposal of a radically new (and here I mean the symmetry aspect) activation function calls for additional evaluations of the utility of the proposed activation function under normal (all data available) cond
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsFocus
