Remembering Transformer for Continual Learning
Yuwei Sun, Ippei Fujisawa, Arthur Juliani, Jun Sakuma, Ryota Kanai

TL;DR
The paper introduces the Remembering Transformer, a novel continual learning model inspired by brain systems, that effectively reduces catastrophic forgetting using a mixture-of-adapters and generative novelty detection, achieving state-of-the-art results.
Contribution
It proposes a memory-efficient Transformer-based architecture with dynamic routing and novelty detection to mitigate catastrophic forgetting in continual learning.
Findings
Achieved 15.90% higher accuracy than previous methods on split tasks.
Reduced model size from 11.18M to 0.22M parameters.
Demonstrated effectiveness across diverse class-incremental and permutation tasks.
Abstract
Neural networks encounter the challenge of Catastrophic Forgetting (CF) in continual learning, where new task learning interferes with previously learned knowledge. Existing data fine-tuning and regularization methods necessitate task identity information during inference and cannot eliminate interference among different tasks, while soft parameter sharing approaches encounter the problem of an increasing model parameter size. To tackle these challenges, we propose the Remembering Transformer, inspired by the brain's Complementary Learning Systems (CLS). Remembering Transformer employs a mixture-of-adapters architecture and a generative model-based novelty detection mechanism in a pretrained Transformer to alleviate CF. Remembering Transformer dynamically routes task data to the most relevant adapter with enhanced parameter efficiency based on knowledge distillation. We conducted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsAttention Is All You Need · Adapter · Linear Layer · Layer Normalization · Dense Connections · Label Smoothing · Residual Connection · Multi-Head Attention · Adam · Dropout
