Mixture of Modular Experts: Distilling Knowledge from a Multilingual   Teacher into Specialized Modular Language Models

Mohammed Al-Maamari; Mehdi Ben Amor; Michael Granitzer

arXiv:2407.19610·cs.AI·July 30, 2024

Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language Models

Mohammed Al-Maamari, Mehdi Ben Amor, Michael Granitzer

PDF

Open Access 1 Repo

TL;DR

This paper introduces a modular multilingual language model framework combining Knowledge Distillation and Mixture of Experts, demonstrating effective language classification, knowledge retention, and efficiency improvements with open-source resources.

Contribution

It presents a novel integration of KD and MoE for multilingual models, compares different architectures, and addresses catastrophic forgetting with practical solutions.

Findings

01

Adaptive alpha in KD offers marginal improvements over fixed alpha.

02

The router classifier achieved 99.95% accuracy in language classification.

03

MoE with common expert mitigates catastrophic forgetting effectively.

Abstract

This research combines Knowledge Distillation (KD) and Mixture of Experts (MoE) to develop modular, efficient multilingual language models. Key objectives include evaluating adaptive versus fixed alpha methods in KD and comparing modular MoE architectures for handling multi-domain inputs and preventing catastrophic forgetting. KD compresses large language models (LLMs) into smaller, efficient models, while MoE enhances modularity with specialized tasks. Experiments showed similar performance for both KD methods, with marginal improvements from adaptive alpha. A combined loss approach provided more stable learning. The router, trained to classify input sequences into English, French, German, or Python, achieved 99.95% precision, recall, and F1 score, with Logistic Regression being the most effective classifier. Evaluations of modular MoE architectures revealed that Pre-trained Language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

padas-lab-de/multi-language-dataset-creator
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecond Language Learning and Teaching · Innovative Teaching and Learning Methods

MethodsLogistic Regression · Mixture of Experts · Knowledge Distillation