Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization
James Oldfield, Markos Georgopoulos, Grigorios G. Chrysos, Christos, Tzelepis, Yannis Panagakis, Mihalis A. Nicolaou, Jiankang Deng, Ioannis, Patras

TL;DR
The paper introduces $mu$MoE layers, a scalable, factorized expert layer for vision models that enables fine-grained specialization without high inference costs or training issues of traditional MoEs.
Contribution
It proposes the $mu$MoE layer, a novel factorized approach for scalable expert specialization in vision models, addressing computational and training challenges of existing MoE methods.
Findings
Scaling $mu$MoE improves class-level expert specialization.
Pre-training with $mu$MoE maintains accuracy while enhancing expert specialization.
Enables manual bias correction in vision tasks.
Abstract
The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. In this paper, we propose the Multilinear Mixture of Experts (MoE) layer to address this, focusing on vision models. MoE layers enable scalable expert specialization by performing an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, MoEs (1) avoid the restrictively high inference-time costs of dense MoEs, yet (2) do not inherit the training issues of the popular sparse MoEs' discrete (non-differentiable) expert routing. We present both qualitative and quantitative evidence that scaling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExpert finding and Q&A systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Average Pooling · Global Average Pooling · Residual Connection · Dropout · Dense Connections · MLP-Mixer · Linear Layer
