Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with   System Co-Design

Ruisi Cai; Yeonju Ro; Geon-Woo Kim; Peihao Wang; Babak Ehteshami; Bejnordi; Aditya Akella; Zhangyang Wang

arXiv:2410.19123·cs.CL·October 28, 2024

Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami, Bejnordi, Aditya Akella, Zhangyang Wang

PDF

Open Access 1 Repo

TL;DR

Read-ME introduces a system-aware framework to convert pre-trained dense LLMs into efficient Mixture-of-Experts models, improving inference speed and accuracy without costly retraining.

Contribution

It proposes a novel pre-gating router design that decouples expert composition from the backbone, optimizing system performance and reducing inference costs.

Findings

01

Achieves up to 10.1% accuracy improvement on MMLU.

02

Reduces mean end-to-end latency by up to 6.1%.

03

Enables scalable, resource-efficient LLM inference.

Abstract

The proliferation of large language models (LLMs) has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and performance. Despite their benefits, MoE models face significant challenges during inference, including inefficient memory management and suboptimal batching, due to misaligned design choices between the model architecture and the system policies. Furthermore, the conventional approach of training MoEs from scratch is increasingly prohibitive in terms of cost. In this paper, we propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to "upcycling" generalist MoEs), avoiding the high costs of ground-up training. Our approach employs activation sparsity to extract experts. To compose experts, we examine the widely-adopted layer-wise router…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vita-group/read-me
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Data Quality and Management

MethodsMixture of Experts