Optimizing MoE Routers: Design, Implementation, and Evaluation in Transformer Models

Daniel Fidel Harvey; George Weale; Berk Yilmaz

arXiv:2506.16419·cs.LG·June 23, 2025

Optimizing MoE Routers: Design, Implementation, and Evaluation in Transformer Models

Daniel Fidel Harvey, George Weale, Berk Yilmaz

PDF

Open Access

TL;DR

This paper compares various router architectures in Mixture of Experts models, analyzing their efficiency, expressiveness, and trade-offs to optimize large-scale transformer performance.

Contribution

It introduces a new MLP-Hadamard router and provides a comprehensive evaluation of six router variants within transformer models.

Findings

01

Linear routers are faster but less expressive.

02

MLP and Attention routers improve accuracy and expressiveness.

03

The MLP-Hadamard router enables structured, sparse routing.

Abstract

Mixture of Experts (MoE) architectures increase large language model scalability, yet their performance depends on the router module that moves tokens to specialized experts. Bad routing can load imbalance and reduced accuracy. This project designed and implemented different router architectures within Transformer models to fix these limitations. We experimented with six distinct router variants Linear, Attention, Multi-Layer Perceptron (MLP), Hybrid, Hash, and our new MLP-Hadamard. We characterized these routers using BERT and the Qwen1.5-MoE model, looking at parameter efficiency, inference latency, routing entropy, and expert utilization patterns. Our evaluations showed distinct trade-offs: Linear routers offer speed, while MLP and Attention routers provide greater expressiveness. The MLP-Hadamard router shows a unique capability for structured, sparse routing. We successfully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Complex Network Analysis Techniques