Nexus: Specialization meets Adaptability for Efficiently Training   Mixture of Experts

Nikolas Gritsch; Qizhen Zhang; Acyr Locatelli; Sara Hooker; and Ahmet \"Ust\"un

arXiv:2408.15901·cs.CL·August 29, 2024

Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts

Nikolas Gritsch, Qizhen Zhang, Acyr Locatelli, Sara Hooker, and Ahmet \"Ust\"un

PDF

Open Access

TL;DR

Nexus is an enhanced Mixture of Experts architecture that enables efficient upcycling of dense models into MoEs with adaptive routing, allowing easy addition of new experts and improved specialization for diverse tasks.

Contribution

The paper introduces Nexus, a novel MoE architecture with adaptive routing that facilitates flexible expert addition and better specialization without extensive retraining.

Findings

01

Nexus achieves up to 2.1% improvement in initial upcycling performance.

02

Nexus attains an 18.8% relative gain when extending with new experts.

03

Flexible expert addition enhances adaptability and specialization in MoE models.

Abstract

Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on "upcycling" dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Expert finding and Q&A systems

MethodsMixture of Experts · Focus