Lory: Fully Differentiable Mixture-of-Experts for Autoregressive   Language Model Pre-training

Zexuan Zhong; Mengzhou Xia; Danqi Chen; Mike Lewis

arXiv:2405.03133·cs.CL·August 20, 2024·6 cites

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis

PDF

Open Access

TL;DR

Lory introduces a fully-differentiable MoE architecture for autoregressive language model pre-training, employing novel routing and batching techniques to improve efficiency, expert specialization, and downstream performance.

Contribution

It is the first to scale fully-differentiable MoE architectures to autoregressive language model pre-training, demonstrating significant performance improvements.

Findings

01

Achieved +13.9% perplexity improvement over dense models

02

Models with segment routing perform competitively with token-level routing MoEs

03

Experts capture domain-level specialization without supervision

Abstract

Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsMixture of Experts