Improving information retention in large scale online continual learning
Zhipeng Cai, Vladlen Koltun, Ozan Sener

TL;DR
This paper investigates the challenge of information retention in large-scale online continual learning, revealing limitations of naive SGD and proposing an adaptive moving average optimizer with a new learning rate schedule to improve performance.
Contribution
It introduces an adaptive moving average optimizer and a novel learning rate schedule specifically designed to enhance information retention in large-scale online continual learning.
Findings
AMA+MALR improves retention on benchmarks
Naive SGD fails to retain information long-term
Proposed methods outperform existing approaches
Abstract
Given a stream of data sampled from non-stationary distributions, online continual learning (OCL) aims to adapt efficiently to new data while retaining existing knowledge. The typical approach to address information retention (the ability to retain previous knowledge) is keeping a replay buffer of a fixed size and computing gradients using a mixture of new data and the replay buffer. Surprisingly, the recent work (Cai et al., 2021) suggests that information retention remains a problem in large scale OCL even when the replay buffer is unlimited, i.e., the gradients are computed using all past data. This paper focuses on this peculiarity to understand and address information retention. To pinpoint the source of this problem, we theoretically show that, given limited computation budgets at each time step, even without strict storage limit, naively applying SGD with constant or constantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI
MethodsStochastic Gradient Descent
