Continual Learning in Deep Networks: an Analysis of the Last Layer
Timoth\'ee Lesort, Thomas George, Irina Rish

TL;DR
This paper investigates how different output layer parameterizations in deep neural networks influence learning and forgetting in continual learning, proposing solutions that improve performance depending on data distribution changes.
Contribution
It provides a detailed analysis of output layer effects on catastrophic forgetting and evaluates parameterization strategies that enhance continual learning without extra algorithms.
Findings
Changing output layer parameterization can mitigate forgetting.
Performance depends on data distribution drifts.
Standard SGD with modified output layers can outperform traditional methods.
Abstract
We study how different output layer parameterizations of a deep neural network affects learning and forgetting in continual learning settings. The following three effects can cause catastrophic forgetting in the output layer: (1) weights modifications, (2) interference, and (3) projection drift. In this paper, our goal is to provide more insights into how changing the output layer parameterization may address (1) and (2). Some potential solutions to those issues are proposed and evaluated here in several continual learning scenarios. We show that the best-performing type of output layer depends on the data distribution drifts and/or the amount of data available. In particular, in some cases where a standard linear layer would fail, changing parameterization is sufficient to achieve a significantly better performance, without introducing any continual-learning algorithm but instead by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Machine Learning and ELM
MethodsLinear Layer · Stochastic Gradient Descent
