Attention Retention for Continual Learning with Vision Transformers

Yue Lu; Xiangyu Zhou; Shizhou Zhang; Yinghui Xing; Guoqiang Liang; Wencong Zhang

arXiv:2602.05454·cs.CV·February 6, 2026

Attention Retention for Continual Learning with Vision Transformers

Yue Lu, Xiangyu Zhou, Shizhou Zhang, Yinghui Xing, Guoqiang Liang, Wencong Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel attention-retaining framework for Vision Transformers to mitigate catastrophic forgetting in continual learning by explicitly constraining attention drift through gradient masking, inspired by neuroscientific insights.

Contribution

It proposes a new method that preserves learned visual concepts in Vision Transformers during continual learning by controlling attention drift with a gradient masking technique.

Findings

01

Achieves state-of-the-art performance in continual learning tasks.

02

Effectively mitigates catastrophic forgetting across diverse scenarios.

03

Preserves attention to previously learned visual concepts.

Abstract

Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Attention Retention for Continual Learning with Vision Transformers· underline

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications