Statistical Advantages of Perturbing Cosine Router in Mixture of Experts
Huy Nguyen, Pedram Akbarian, Trang Pham, Trang Nguyen, Shujian Zhang,, Nhat Ho

TL;DR
This paper analyzes the statistical properties of the cosine router in Mixture of Experts, revealing slow convergence rates that can be improved by adding noise, and validates these findings through extensive simulations.
Contribution
It provides the first comprehensive theoretical analysis of the cosine router in MoE, demonstrating how noise perturbation improves estimation rates.
Findings
Without perturbation, estimation rates are as slow as O(1/ log^τ(n)).
Adding noise to the cosine router improves convergence to polynomial rates.
Simulation studies confirm the theoretical improvements in both synthetic and real data.
Abstract
The cosine router in Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in MoE has been lacking. Considering the least square estimation of the cosine routing MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as where is some constant and is the sample size.…
Peer Reviews
Decision·ICLR 2025 Poster
Originality: The paper provides the first theoretical study of the cosine router MoE and it's perturbed version and confirms the theoretical advantage of the later one, which can be considered as novel. Quality: The author addresses some technical challenges to complete the theoretical study. For example, the normalization of the cosine router introduces more sophisticated parameter interactions among the elements of the router parameters which needed to be addressed. Clarity: The paper is we
1. The scope of the theoretical result is limited as it required that, the ground truth is itself generated by a cosine routing MoE. It is not clear if practical datasets meet this assumption. Could you please discuss the implications of your results for real-world datasets that may not perfectly match the theoretical assumptions? Any intuition for how their findings might generalize to more realistic settings. 2. In the numerical experiments at section 5.1, the ground truth is generated by an
Routing schemes such as the cosine router are very important components in modern machine learning models, but there has been little theoretical study. This paper presents a compelling theoretical picture that a perturbed cosine router should have better rates than the standard cosine router. The experiments are also extensive and seem quite convincing. Finally, the paper is generally quite well-written.
My main concern is that some of the most important theoretical conclusions of the paper are not formally stated, let alone proven: the claims directly after Theorem 2 about how the theorem implies slow rates for parameter estimation and expert estimation. I (roughly speaking) believe the intuition for (i), though I have some questions -- see below -- but I completely do not understand the intuition for (ii). The claim is that since the parameters \eta are hard to estimate, and h(\eta) is Lipschi
The paper is clearly and logically presented, making it easy to follow the methodology and findings. The experimental results validate the theoretical analysis and demonstrate the effectiveness of the proposed method, providing solid support for the authors' claims.
As I am not familiar with the statistical techniques used in this field, I found it challenging to fully assess the novelty of this paper compared to prior work. Specifically, in my literature review, I noted that [1] appears closely related to this study. Could the authors clarify any key differences or advancements offered by their approach compared to [1]? Additionally, as noted by [2], the sparse Mixture of Experts (MoE) approach appears to offer improved generalization capabilities. [1] Ng
Videos
Taxonomy
TopicsComplex Network Analysis Techniques · Human Mobility and Location-Based Analysis · Bayesian Methods and Mixture Models
MethodsMixture of Experts
