Optimizing MoE Routers: Design, Implementation, and Evaluation in Transformer Models
Daniel Fidel Harvey, George Weale, Berk Yilmaz

TL;DR
This paper compares various router architectures in Mixture of Experts models, analyzing their efficiency, expressiveness, and trade-offs to optimize large-scale transformer performance.
Contribution
It introduces a new MLP-Hadamard router and provides a comprehensive evaluation of six router variants within transformer models.
Findings
Linear routers are faster but less expressive.
MLP and Attention routers improve accuracy and expressiveness.
The MLP-Hadamard router enables structured, sparse routing.
Abstract
Mixture of Experts (MoE) architectures increase large language model scalability, yet their performance depends on the router module that moves tokens to specialized experts. Bad routing can load imbalance and reduced accuracy. This project designed and implemented different router architectures within Transformer models to fix these limitations. We experimented with six distinct router variants Linear, Attention, Multi-Layer Perceptron (MLP), Hybrid, Hash, and our new MLP-Hadamard. We characterized these routers using BERT and the Qwen1.5-MoE model, looking at parameter efficiency, inference latency, routing entropy, and expert utilization patterns. Our evaluations showed distinct trade-offs: Linear routers offer speed, while MLP and Attention routers provide greater expressiveness. The MLP-Hadamard router shows a unique capability for structured, sparse routing. We successfully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Complex Network Analysis Techniques
