Mixture of Diverse Size Experts

Manxi Sun; Wei Liu; Jian Luan; Pengzhi Gao; Bin Wang

arXiv:2409.12210·cs.LG·September 20, 2024

Mixture of Diverse Size Experts

Manxi Sun, Wei Liu, Jian Luan, Pengzhi Gao, Bin Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces MoDSE, a novel mixture-of-experts architecture with experts of varying sizes, improving language model performance and efficiency by better matching expert size to token needs.

Contribution

The paper proposes a new MoE design with diverse expert sizes and an expert-pair allocation strategy to balance workload and enhance model performance.

Findings

01

Experts of different sizes improve prediction accuracy.

02

Stable routing paths after training enhance model reliability.

03

Workload balancing strategy ensures efficient GPU utilization.

Abstract

The Sparsely-Activated Mixture-of-Experts (MoE) has gained increasing popularity for scaling up large language models (LLMs) without exploding computational costs. Despite its success, the current design faces a challenge where all experts have the same size, limiting the ability of tokens to choose the experts with the most appropriate size for generating the next token. In this paper, we propose the Mixture of Diverse Size Experts (MoDSE), a new MoE architecture with layers designed to have experts of different sizes. Our analysis of difficult token generation tasks shows that experts of various sizes achieve better predictions, and the routing path of the experts tends to be stable after a training period. However, having experts of diverse sizes can lead to uneven workload distribution. To tackle this limitation, we introduce an expert-pair allocation strategy to evenly distribute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mixture of Diverse Size Experts· underline

Taxonomy

TopicsSurvey Sampling and Estimation Techniques · Bayesian Methods and Mixture Models · SARS-CoV-2 detection and testing

MethodsMixture of Experts