Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami, Bejnordi, Aditya Akella, Zhangyang Wang

TL;DR
Read-ME introduces a system-aware framework to convert pre-trained dense LLMs into efficient Mixture-of-Experts models, improving inference speed and accuracy without costly retraining.
Contribution
It proposes a novel pre-gating router design that decouples expert composition from the backbone, optimizing system performance and reducing inference costs.
Findings
Achieves up to 10.1% accuracy improvement on MMLU.
Reduces mean end-to-end latency by up to 6.1%.
Enables scalable, resource-efficient LLM inference.
Abstract
The proliferation of large language models (LLMs) has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and performance. Despite their benefits, MoE models face significant challenges during inference, including inefficient memory management and suboptimal batching, due to misaligned design choices between the model architecture and the system policies. Furthermore, the conventional approach of training MoEs from scratch is increasingly prohibitive in terms of cost. In this paper, we propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to "upcycling" generalist MoEs), avoiding the high costs of ground-up training. Our approach employs activation sparsity to extract experts. To compose experts, we examine the widely-adopted layer-wise router…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Data Quality and Management
MethodsMixture of Experts
