Mixtures of SubExperts for Large Language Continual Learning
Haeyong Kang

TL;DR
This paper introduces Mixtures of SubExperts (MoSEs), an adaptive parameter-efficient fine-tuning method for large language models that minimizes forgetting and scales efficiently in continual learning scenarios.
Contribution
MoSEs integrate sparse SubExperts with a task-specific routing mechanism, enabling knowledge transfer and minimal forgetting while maintaining sublinear growth in model capacity.
Findings
MoSEs outperform traditional methods in knowledge retention.
MoSEs achieve state-of-the-art results on TRACE benchmarks.
MoSEs require less memory and computation.
Abstract
Adapting Large Language Models (LLMs) to a continuous stream of tasks is a critical yet challenging endeavor. While Parameter-Efficient Fine-Tuning (PEFT) methods have become a standard for this, they face a fundamental dilemma in continual learning. Reusing a single set of PEFT parameters for new tasks often leads to catastrophic forgetting of prior knowledge. Conversely, allocating distinct parameters for each task prevents forgetting but results in a linear growth of the model's size and fails to facilitate knowledge transfer between related tasks. To overcome these limitations, we propose a novel adaptive PEFT method referred to as \textit{Mixtures of SubExperts (MoSEs)}, a novel continual learning framework designed for minimal forgetting and efficient scalability. MoSEs integrate a sparse Mixture of SubExperts into the transformer layers, governed by a task-specific routing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
