Multilinear Mixture of Experts: Scalable Expert Specialization through   Factorization

James Oldfield; Markos Georgopoulos; Grigorios G. Chrysos; Christos; Tzelepis; Yannis Panagakis; Mihalis A. Nicolaou; Jiankang Deng; Ioannis; Patras

arXiv:2402.12550·cs.CV·October 18, 2024·1 cites

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

James Oldfield, Markos Georgopoulos, Grigorios G. Chrysos, Christos, Tzelepis, Yannis Panagakis, Mihalis A. Nicolaou, Jiankang Deng, Ioannis, Patras

PDF

Open Access 2 Repos 1 Video

TL;DR

The paper introduces $mu$MoE layers, a scalable, factorized expert layer for vision models that enables fine-grained specialization without high inference costs or training issues of traditional MoEs.

Contribution

It proposes the $mu$MoE layer, a novel factorized approach for scalable expert specialization in vision models, addressing computational and training challenges of existing MoE methods.

Findings

01

Scaling $mu$MoE improves class-level expert specialization.

02

Pre-training with $mu$MoE maintains accuracy while enhancing expert specialization.

03

Enables manual bias correction in vision tasks.

Abstract

The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. In this paper, we propose the Multilinear Mixture of Experts ( $μ$ MoE) layer to address this, focusing on vision models. $μ$ MoE layers enable scalable expert specialization by performing an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, $μ$ MoEs (1) avoid the restrictively high inference-time costs of dense MoEs, yet (2) do not inherit the training issues of the popular sparse MoEs' discrete (non-differentiable) expert routing. We present both qualitative and quantitative evidence that scaling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization· slideslive

Taxonomy

TopicsExpert finding and Q&A systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Average Pooling · Global Average Pooling · Residual Connection · Dropout · Dense Connections · MLP-Mixer · Linear Layer