Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts
Klaus-Rudolf Kladny, Maximilian Mordig, Bernhard Sch\"olkopf, Michael Muehlebach

TL;DR
This paper introduces AIR-MoE, a two-stage routing method for granular mixture-of-experts models that improves efficiency and performance without altering model architecture.
Contribution
AIR-MoE is a novel inverted-index-inspired routing architecture that approximates top-k expert selection efficiently in granular MoE models.
Findings
AIR-MoE improves routing efficiency in granular MoE settings.
AIR-MoE achieves better performance than existing routing methods.
The method requires no changes to model architecture or loss function.
Abstract
Mixture-of-experts (MoE) models enable scalable transformer architectures by activating only a subset of experts per token. Recent evidence suggests that performance improves with increasingly granular experts, i.e., many small experts instead of a few large ones. However, this regime substantially increases routing cost, which can dominate computation. We introduce adaptive inverted-index routing for MoE (AIR-MoE), an inverted-index-inspired routing architecture based on vector quantization (VQ). In a first stage, AIR-MoE performs coarse shortlisting by assigning tokens to VQ codewords to construct a candidate set of experts. In a second stage, fine scoring computes exact routing scores restricted to this shortlist. This two-stage procedure approximates true top-k routing while avoiding full expert scoring and, in contrast to prior work, imposing no structural constraints on expert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
