DirMoE: Dirichlet-routed Mixture of Experts
Amirhossein Vahidi, Hesam Asadollahzadeh, Navid Akhavan Attar, Marie Moullet, Kevin Ly, Xingyi Yang, Mohammad Lotfollahi

TL;DR
DirMoE introduces a fully differentiable Dirichlet-based routing mechanism for Mixture-of-Experts models, enabling better expert selection and contribution control, leading to improved performance and specialization.
Contribution
We propose DirMoE, a novel end-to-end differentiable routing method that disentangles expert selection and contribution, improving scalability and expert specialization in MoE models.
Findings
DirMoE matches or exceeds existing routing methods in performance.
It provides explicit control over the number of active experts.
The model enhances expert specialization and scalability.
Abstract
Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the…
Peer Reviews
Decision·ICLR 2026 Poster
- **Interpretable sparsity control.** The Beta/Dirichlet calibration + Simpson-index theory offers a principled knob for expected mass on the active set and dispersion, beyond temperature heuristics. - **Light systems delta.** Reported iteration time/throughput match Switch under compute parity; the router adds negligible overhead. - **Empirical improvements.** On 7 zero-shot tasks, DirMoE slightly outperforms Switch/ReMoE/SparseMixer on average, and shows stronger expert specialization visuali
- **Load-balancing risk.** The router relies on near-binary masks and dispersion calibration without an explicit balancing mechanism; the paper itself notes potential utilization skew. - **Limited scale & evaluation breadth.** All results are on a ~185M-param LLaMA with ~30B tokens; there is no evidence at larger scale/more experts or on standard reasoning/coding suites (e.g., MMLU, HumanEval/HellaSwag). External validity for modern LLMs remains unclear. - **Questionable reuse of prior results.*
1. **Novelty:** The core idea of disentangling expert selection (Bernoulli/Gumbel) from expert contribution (Dirichlet) is well-motivated and novel. Framing this as a probabilistic spike-and-slab model within a VAE framework is a novel approach to MoE routing. 2. **Controllable Sparsity:** I really like the theoretical analysis introduced in Section 5, which connects the Dirichlet concentration parameter $\lambda$ to the expected contribution sparsity. It provides a principled calibrated kn
1. **Clarity of the Training Objective:** The paper's primary weakness is its lack of clarity regarding the overall training objective. The VAE objective (Eq. 8) is presented, but it's not explicitly stated how this loss is combined with the main LM loss. This is a critical detail for reproducibility and understanding. 2. **Justification of the VAE Objective:** The VAE's reconstruction task—reconstructing the token embedding $x$ from the routing vector $r(x)$—is non-obvious. The authors cou
**Originality**: The idea of factorizing routing using a spike-and-slab prior implemented via Gumbel-Sigmoid (selection) and Dirichlet (contribution) is highly original. **Quality**: The use of Gumbel-sigmoid and implicit reparameterization ensures the entire forward pass remains fully differentiable, avoiding the gradient bottlenecks of standard Top-k routing. This white-box design provides superior interpretability by explicitly controlling how many experts are active ($k$) and how concentra
**Architectural and Optimization Complexity**: DirMoE relies on a complex stack of techniques: Gumbel-Sigmoid relaxation, implicit reparameterization for Dirichlet samples, a full VAE objective, and multiple scheduled hyperparameters ($\tau_z$, $\alpha_{lo}$, $\lambda^{(p,t)}$). While mathematically elegant, this complexity may lead to significant tuning overhead compared to simpler, fully continuous approaches like ReMoE, which should be explicitly discussed and benchmarked for tuning difficult
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Network Analysis Techniques · Advanced Graph Neural Networks · Topic Modeling
