Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers
Albus Yizhuo Li, Matthew Wicker

TL;DR
This paper introduces VMoER, a scalable Bayesian approach for uncertainty quantification in Mixture-of-Experts Transformers, significantly improving calibration and robustness with minimal computational overhead.
Contribution
The paper presents VMoER, a structured Bayesian routing method for MoE layers that enhances uncertainty calibration and stability at large scale models.
Findings
Improves routing stability under noise by 38%
Reduces calibration error by 94%
Increases out-of-distribution AUROC by 12%
Abstract
Foundation models are increasingly being deployed in contexts where understanding the uncertainty of their outputs is critical to ensuring responsible deployment. While Bayesian methods offer a principled approach to uncertainty quantification, their computational overhead renders their use impractical for training or inference at foundation model scale. State-of-the-art models achieve parameter counts in the trillions through carefully engineered sparsity including Mixture-of-Experts (MoE) layers. In this work, we demonstrate calibrated uncertainty at scale by introducing Variational Mixture-of-Experts Routing (VMoER), a structured Bayesian approach for modelling uncertainty in MoE layers. VMoER confines Bayesian inference to the expert-selection stage which is typically done by a deterministic routing network. We instantiate VMoER using two inference strategies: amortised variational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOpportunistic and Delay-Tolerant Networks · Software-Defined Networks and 5G · Adversarial Robustness in Machine Learning
