Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models
Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat

TL;DR
This paper introduces MoE-MLA-RoPE, a novel architecture combining Mixture of Experts, Multi-head Latent Attention, and Rotary Position Embeddings, achieving significant efficiency improvements in language models without sacrificing performance.
Contribution
The paper presents a new architecture that unifies MoE, MLA, and RoPE, with innovations in expert routing, shared expert isolation, and load balancing, enabling more efficient language models.
Findings
68% KV cache memory reduction
3.2x inference speedup
6.9% improvement in validation loss
Abstract
We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient language modeling. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top- selection, enabling flexible specialization through 3.6 * 10^7 possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization. Extensive experiments on models ranging from 17M to 202M parameters demonstrate that MoE-MLA-RoPE with compression ratio r=d/2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
