Mixtures of SubExperts for Large Language Continual Learning

Haeyong Kang

arXiv:2511.06237·cs.LG·November 11, 2025

Mixtures of SubExperts for Large Language Continual Learning

Haeyong Kang

PDF

Open Access

TL;DR

This paper introduces Mixtures of SubExperts (MoSEs), an adaptive parameter-efficient fine-tuning method for large language models that minimizes forgetting and scales efficiently in continual learning scenarios.

Contribution

MoSEs integrate sparse SubExperts with a task-specific routing mechanism, enabling knowledge transfer and minimal forgetting while maintaining sublinear growth in model capacity.

Findings

01

MoSEs outperform traditional methods in knowledge retention.

02

MoSEs achieve state-of-the-art results on TRACE benchmarks.

03

MoSEs require less memory and computation.

Abstract

Adapting Large Language Models (LLMs) to a continuous stream of tasks is a critical yet challenging endeavor. While Parameter-Efficient Fine-Tuning (PEFT) methods have become a standard for this, they face a fundamental dilemma in continual learning. Reusing a single set of PEFT parameters for new tasks often leads to catastrophic forgetting of prior knowledge. Conversely, allocating distinct parameters for each task prevents forgetting but results in a linear growth of the model's size and fails to facilitate knowledge transfer between related tasks. To overcome these limitations, we propose a novel adaptive PEFT method referred to as \textit{Mixtures of SubExperts (MoSEs)}, a novel continual learning framework designed for minimal forgetting and efficient scalability. MoSEs integrate a sparse Mixture of SubExperts into the transformer layers, governed by a task-specific routing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications