SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with   Tunable Memory Budget

Rui Kong; Yuanchun Li; Qingtian Feng; Weijun Wang; Xiaozhou Ye; Ye; Ouyang; Linghe Kong; Yunxin Liu

arXiv:2308.15030·cs.AI·May 30, 2024

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye, Ouyang, Linghe Kong, Yunxin Liu

PDF

Open Access 1 Video

TL;DR

SwapMoE is a framework that enables efficient serving of large MoE-based language models on memory-limited devices by maintaining a small set of important experts, reducing memory and latency with minimal accuracy loss.

Contribution

It introduces a dynamic expert selection mechanism that allows MoE models to operate within tunable memory budgets without significant performance degradation.

Findings

01

Reduced memory footprint from 14.2 GiB to 4.7 GiB.

02

Achieved 50% latency reduction.

03

Maintained near-original accuracy with slight Rouge-2 score drop.

Abstract

Mixture of experts (MoE) is a popular technique to improve capacity of Large Language Models (LLMs) with conditionally-activated parallel experts. However, serving MoE models on memory-constrained devices is challenging due to the large parameter size. Typical solutions such as memory swapping or expert pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. Experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy. For example, on text summarization tasks with Switch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Expert finding and Q&A systems

MethodsAttention Is All You Need · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Absolute Position Encodings · Softmax