Mixture of Experts with Soft Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement
Abien Fred Agarap, Arnulfo P. Azcarraga

TL;DR
This paper introduces an improved Mixture-of-Experts model that uses Soft Nearest Neighbor Loss to prevent expert collapse, leading to more diverse experts and better classification accuracy.
Contribution
The authors propose a novel MoE architecture with a feature extractor trained using SNNL to promote representation disentanglement and expert diversity.
Findings
Enhanced MoE models show increased expert diversity.
The approach improves classification accuracy on multiple datasets.
Structural expert collapse is effectively mitigated.
Abstract
The Mixture-of-Experts (MoE) model uses a set of expert networks that specialize on subsets of a dataset under the supervision of a gating network. A common issue in MoE architectures is ``expert collapse'' where overlapping class boundaries in the raw input feature space cause multiple experts to learn redundant representations, thus forcing the gating network into rigid routing to compensate. We propose an enhanced MoE architecture that utilizes a feature extractor network optimized using Soft Nearest Neighbor Loss (SNNL) prior to feeding input features to the gating and expert networks. By pre-conditioning the latent space to minimize distances among class-similar data points, we resolve structural expert collapse which results to experts learning highly orthogonal weights. We employ Expert Specialization Entropy and Pairwise Embedding Similarity to quantify this dynamic. We evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
