TL;DR
This paper introduces the Adaptive Clustering (AC) router for Sparse Mixture-of-Experts models, which improves convergence, robustness, and performance by better identifying and matching latent input clusters.
Contribution
The paper proposes a novel AC router that adaptively weights features to enhance clustering and expert matching in MoE models, leading to faster, more robust, and more accurate models.
Findings
AC router achieves faster convergence in MoE training.
AC router improves robustness to data corruption.
AC router enhances overall performance in language and image tasks.
Abstract
Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
