Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts

Youngseog Chung; Dhruv Malik; Jeff Schneider; Yuanzhi Li; Aarti Singh

arXiv:2409.00879·cs.LG·September 4, 2024

Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts

Youngseog Chung, Dhruv Malik, Jeff Schneider, Yuanzhi Li, Aarti Singh

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the implicit biases of Soft Mixture of Experts models, showing they cannot represent simple functions with a single expert, and exploring how multiple experts contribute to representation power and specialization.

Contribution

It provides theoretical insights into Soft MoE's limitations and introduces a method to identify specialized experts efficiently, enhancing understanding of expert collaboration.

Findings

01

Soft MoE with one expert cannot represent simple convex functions.

02

Multiple experts are necessary for sufficient representation power.

03

Empirical method to identify specialized experts efficiently.

Abstract

The traditional viewpoint on Sparse Mixture of Experts (MoE) models is that instead of training a single large expert, which is computationally expensive, we can train many small experts. The hope is that if the total parameter count of the small experts equals that of the singular large expert, then we retain the representation power of the large expert while gaining computational tractability and promoting expert specialization. The recently introduced Soft MoE replaces the Sparse MoE's discrete routing mechanism with a differentiable gating function that smoothly mixes tokens. While this smooth gating function successfully mitigates the various training instabilities associated with Sparse MoE, it is unclear whether it induces implicit biases that affect Soft MoE's representation power or potential for expert specialization. We prove that Soft MoE with a single arbitrarily powerful…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

The paper gives a good summary of the soft MoE model and sets up notation in a clear way. The notion of specialization is interesting. The proposed algorithm for selecting experts is simple and intuitive.

Weaknesses

Section 3 on the representation failure of a single expert is quite trivial and obvious. Each expert essentially sees X only through a single d-dimensional linear projection, and hence any function of X that uses all of X in a non-linear manner clearly cannot be represented by a single expert regardless of how complex the expert is. Mutiple experts can potentially overcome this because they would each look at different projections. Section 4 results on specialization are also quite weak. Choosi

Reviewer 02Rating 3Confidence 4

Strengths

The paper considers important and relevant questions regarding implicit biases in Soft MoEs, particularly regarding expert specialization. I find these questions relevant and interesting for the ML community.

Weaknesses

- **W1. SoftMoE Formulation**: The SoftMoE formulation presented in Section 2 of this paper differs from that of the original SoftMoE paper [1]. In the original formulation, each expert processes $p$ slots, making $ \Phi \in \mathbb{R}^{d \times (n \cdot p)}$ and $C(X) \in \mathbb{R}^{m \times (n \cdot p)}$. However, in this paper, the parameter $p$ is not included. Notably, setting $p=1$ in the original formulation results in an incomparability with other papers using SoftMoE, where $p$ is reco

Reviewer 03Rating 6Confidence 4

Strengths

- The originality lies in the paper’s creative approach, combining established concepts in expert modeling with a critical examination of representational limitations, as demonstrated by Theorem 1 and the empirical evidence in Appendix B. - This work has practical significance for advancing the design of scalable MoE models. By addressing the limitations in single-expert representational capacity, the paper opens up discussions on how architectural adjustments, such as multiple experts or effi

Weaknesses

- **Clarity in Methodological Exposition:** Certain aspects of the methodology, such as the approach for selecting the number of experts $k$ for prediction in real applications, would benefit from more detailed and clearer explanations. - **Clarity in Technical Proof Exposition:** Including a proof sketch at the beginning of Appendix A would enhance readability and provide a clearer roadmap for understanding the technical details. Additionally, a more detailed explanation of why the equalities

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExperimental Behavioral Economics Studies · Opinion Dynamics and Social Influence

MethodsMixture of Experts