Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models

Sushant Mehta; Raj Dandekar; Rajat Dandekar; Sreedath Panat

arXiv:2508.01261·cs.AI·August 5, 2025

Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models

Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat

PDF

Open Access 1 Video

TL;DR

This paper introduces MoE-MLA-RoPE, a novel architecture combining Mixture of Experts, Multi-head Latent Attention, and Rotary Position Embeddings, achieving significant efficiency improvements in language models without sacrificing performance.

Contribution

The paper presents a new architecture that unifies MoE, MLA, and RoPE, with innovations in expert routing, shared expert isolation, and load balancing, enabling more efficient language models.

Findings

01

68% KV cache memory reduction

02

3.2x inference speedup

03

6.9% improvement in validation loss

Abstract

We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient language modeling. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top- $k$ selection, enabling flexible specialization through 3.6 * 10^7 possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization. Extensive experiments on models ranging from 17M to 202M parameters demonstrate that MoE-MLA-RoPE with compression ratio r=d/2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques