Horseshoe Mixtures-of-Experts (HS-MoE)
Nick Polson, Vadim Sokolov

TL;DR
This paper introduces Horseshoe Mixtures-of-Experts (HS-MoE), a Bayesian model that achieves data-adaptive sparsity in expert selection using a novel particle learning inference algorithm, relevant for large language models.
Contribution
It presents a new Bayesian framework with a particle learning algorithm for sequential inference in sparse mixture-of-experts models, connecting to large language model architectures.
Findings
Effective sparse expert selection via horseshoe prior
Particle learning algorithm for sequential inference
Relevance to large language models with extreme sparsity
Abstract
Horseshoe mixtures-of-experts (HS-MoE) models provide a Bayesian framework for sparse expert selection in mixture-of-experts architectures. We combine the horseshoe prior's adaptive global-local shrinkage with input-dependent gating, yielding data-adaptive sparsity in expert usage. Our primary methodological contribution is a particle learning algorithm for sequential inference, in which the filter is propagated forward in time while tracking only sufficient statistics. We also discuss how HS-MoE relates to modern mixture-of-experts layers in large language models, which are deployed under extreme sparsity constraints (e.g., activating a small number of experts per token out of a large pool).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Machine Learning and Algorithms · Advanced Bandit Algorithms Research
