Forgetting is Everywhere

Ben Sanati; Thomas L. Lee; Trevor McInroe; Aidan Scannell; Nikolay Malkin; David Abel; Amos Storkey

arXiv:2511.04666·cs.LG·February 3, 2026

Forgetting is Everywhere

Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a unified theory of forgetting in learning algorithms, characterizing it as a lack of self-consistency in predictive distributions, and demonstrates how Bayesian inference can mitigate forgetting across various learning tasks.

Contribution

It provides a novel, task- and algorithm-agnostic framework for understanding forgetting, linking it to predictive information loss and showing Bayesian inference as a solution.

Findings

01

Forgetting occurs across all deep learning settings.

02

Bayesian inference can prevent forgetting.

03

Forgetting impacts learning efficiency.

Abstract

A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner's predictive distribution, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm's propensity to forget and demonstrates that exact Bayesian inference allows for adaptation without forgetting. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

Strengths 1. The paper is theoretically innovative, proposing a task-agnostic and algorithm-agnostic definition of forgetting that could serve as a foundational framework for theoretical analysis in the community. 2. The mathematical derivations are rigorous and the concepts are clearly defined. 3. The experiments cover multiple learning paradigms. 4. The finding that "moderate forgetting leads to optimal learning efficiency" is insightful.

Weaknesses

Weaknesses 1. First, the "powerful insight" stated in the introduction is, while correct, rather self-evident and perhaps too simple to be considered an insight. In deep learning, it is well known that parameter updates via back-propagation naturally change the model's internal representations and thereby its predictions on data, which in turn causes forgetting. This seems more like a direct consequence of parameter modification than a novel conceptual observation. 2. The paper presents basic y

Reviewer 02Rating 4Confidence 3

Strengths

- The paper is generally well written and organized (though a few statements may need improvement). - The work introduces a novel conceptual angle by defining forgetting based on induced futures rather than accuracy degradation.

Weaknesses

- The statement in Lines 45–47 appears too strong (and several other statements in the paper with this assumption). The claim assumes that the learner does not gain any new information. However: - How can we ensure the learner indeed learns nothing new during such updates, especially under stochasticity? - If the data distribution changes (e.g., under distribution shift or class-incremental settings), changes in induced futures may come from newly acquired knowledge, not forgetting. This

Reviewer 03Rating 4Confidence 4

Strengths

1. It is interesting to study forgetting issue as a general property of machine learning models. The idea is novel. 2. It is novel to measure the forgetfulness as the propensity to forget, i.e., how much the learner system's internal representation of the future drifts purely due to its own updates, independent of environmental changes. 3. The paper conducts experiments spanning multiple machine learning settings, such as regression, classification, and reinforcement learning, leading to som

Weaknesses

1. The paper lacks the theorectical grounding to justify the notion of predictive consistency. 2. It is unclear what the direct benefits of understanding forgetfulness of a learner's sytems are, especially, how it can contribute to the downstream tasks. It would be more interesting to investigate how the understanding can help mitigate the forgetting issues. 3. The paper focuses on the analysis on a system's self consistency in predictive disributions, treating the system as a blackbox and off

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Evolutionary Algorithms and Applications