MoExtend: Tuning New Experts for Modality and Task Extension
Shanshan Zhong, Shanghua Gao, Zhongzhan Huang, Wushao Wen, Marinka, Zitnik, Pan Zhou

TL;DR
MoExtend introduces a framework that efficiently extends large language models with new modalities and tasks by integrating new experts into pre-trained MoE models, avoiding full fine-tuning and catastrophic forgetting.
Contribution
It presents a novel method for modality and task extension in MoE models that requires no tuning of pretrained models, enabling rapid and effective multimodal adaptation.
Findings
MoExtend effectively enhances multimodal capabilities of LLMs.
The approach reduces training costs and mitigates catastrophic forgetting.
Experimental results show improved performance in multimodal tasks.
Abstract
Large language models (LLMs) excel in various tasks but are primarily trained on text data, limiting their application scope. Expanding LLM capabilities to include vision-language understanding is vital, yet training them on multimodal data from scratch is challenging and costly. Existing instruction tuning methods, e.g., LLAVA, often connects a pretrained CLIP vision encoder and LLMs via fully fine-tuning LLMs to bridge the modality gap. However, full fine-tuning is plagued by catastrophic forgetting, i.e., forgetting previous knowledge, and high training costs particularly in the era of increasing tasks and modalities. To solve this issue, we introduce MoExtend, an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies
MethodsContrastive Language-Image Pre-training · Mixture of Experts
