Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Lean Wang; Huazuo Gao; Chenggang Zhao; Xu Sun; Damai Dai

arXiv:2408.15664·cs.LG·August 29, 2024·6 cites

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, Damai Dai

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Loss-Free Balancing, a novel load balancing strategy for Mixture-of-Experts models that maintains expert load balance without auxiliary loss, improving performance and efficiency.

Contribution

The paper proposes a new load balancing method for MoE models that avoids auxiliary loss and interference gradients, enhancing model performance and load distribution.

Findings

01

Achieves better load balance than traditional methods.

02

Improves model performance without auxiliary loss interference.

03

Validated on models with up to 3B parameters and 200B tokens.

Abstract

For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy. To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load. In addition, since Loss-Free Balancing does not produce any…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

- The paper is well-written and the method is clearly explained - Good visualizations - Simple approach that should be very easy to test and cheaper to compute than the conventionally used load-balancing loss

Weaknesses

- I am concerned over the validity of the claims. The empirical evaluations are very limited constrained to two DeepSeekMoE models and perplexity differences among the models and the baselines are at the level of 0.05 difference. Is this difference in perplexity significant? - The evaluation is limited to language modelling and perplexity values. It would be better to see the actual effect of the loss-free load balancing on other downstream tasks such as MMLU or GLUE. - The proposed Max Violat

Reviewer 02Rating 5Confidence 4

Strengths

This paper has the following strengths: 1. The proposed Loss-Free Balancing method eliminates the need for auxiliary loss, which traditionally adds undesirable interference gradients. This results in a cleaner training signal focused solely on the primary language modeling objective, potentially enhancing overall model performance. 2. By dynamically adjusting biases for each expert based on recent load data, the method ensures a balanced expert load without compromising model efficiency. 3. T

Weaknesses

This paper has the following weaknesses: 1. I am at first astonished by the short reference list of this paper, as the authors only cited 10 papers. Clearly, this paper did a very bad job on surveying the related work, including the various auxilliary-loss-based balancing methods, the major improvement of MoEs, the current MoE-based LLMs. Normally, I would list a few of the works for your reference, but the authors missed too many so I do not know where to start. I would strongly suggest the au

Reviewer 03Rating 5Confidence 2

Strengths

This is an interesting research problem and the author aims to develop an efficient solution approach

Weaknesses

1. The motivation and underlying intuition for the proposed approach could be clarified further to enhance understanding. 2. Additional experiments are recommended to demonstrate the robustness of this approach when applied across varying numbers of expert mixtures. A scalable analysis would also be beneficial. 3. The approach would be strengthened with theoretical justification to substantiate its effectiveness.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Sensor Networks and Detection Algorithms · Mobile Crowdsensing and Crowdsourcing · Privacy-Preserving Technologies in Data

MethodsMixture of Experts