Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability
Sergi Masip, Gido M. van de Ven, Javier Ferrando, Tinne Tuytelaars

TL;DR
This paper introduces a mechanistic, feature-centric framework for understanding catastrophic forgetting in continual learning, emphasizing transformations of feature encoding and their impact on model capacity.
Contribution
It offers a geometric interpretation of forgetting, formal analysis with toy models, and demonstrates practical applications using Crosscoders on Vision Transformers.
Findings
Transformations to feature encoding cause forgetting by reducing feature capacity.
Depth in models exacerbates catastrophic forgetting.
The framework applies to practical models like Vision Transformers trained on CIFAR-10.
Abstract
Catastrophic forgetting in continual learning is often measured at the performance or last-layer representation level, overlooking the underlying mechanisms. We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features or by disrupting their readout by downstream computations. Analysis of a tractable toy model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders. We do so through a case study example of a Vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
