Mixture of Experts Meets Prompt-Based Continual Learning
Minh Le, An Nguyen, Huy Nguyen, Trang Nguyen, Trang Pham, Linh Van, Ngo, Nhat Ho

TL;DR
This paper reveals that the attention mechanism in pre-trained models functions as a mixture of experts, leading to a novel gating method called NoRGa that improves prompt-based continual learning.
Contribution
It provides a theoretical understanding of prompt effectiveness, introduces a new gating mechanism, and demonstrates improved continual learning performance.
Findings
Attention blocks encode mixture of experts architecture.
NoRGa improves continual learning performance.
Theoretical and empirical validation across benchmarks.
Abstract
Exploiting the power of pre-trained models, prompt-based approaches stand out compared to other continual learning solutions in effectively preventing catastrophic forgetting, even with very few learnable parameters and without the need for a memory buffer. While existing prompt-based continual learning methods excel in leveraging prompts for state-of-the-art performance, they often lack a theoretical explanation for the effectiveness of prompting. This paper conducts a theoretical analysis to unravel how prompts bestow such advantages in continual learning, thus offering a new perspective on prompt design. We first show that the attention block of pre-trained models like Vision Transformers inherently encodes a special mixture of experts architecture, characterized by linear experts and quadratic gating score functions. This realization drives us to provide a novel view on prefix…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning
