Sparse Backpropagation for MoE Training
Liyuan Liu, Jianfeng Gao, Weizhu Chen

TL;DR
SparseMixer introduces a scalable gradient estimator for MoE models that enables reliable backpropagation with sparse expert routing, significantly improving training efficiency and convergence.
Contribution
It presents SparseMixer, a novel ODE-based gradient estimator that approximates dense gradients in sparse MoE training, enhancing scalability and performance.
Findings
Accelerates training convergence up to 2x
Provides reliable gradient estimates in sparse MoE models
Improves performance on Switch Transformer tasks
Abstract
One defining characteristic of Mixture-of-Expert (MoE) models is their capacity for conducting sparse computation via expert routing, leading to remarkable scalability. However, backpropagation, the cornerstone of deep learning, requires dense computation, thereby posting challenges in MoE gradient computations. Here, we introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing. Unlike typical MoE training which strategically neglects certain gradient terms for the sake of sparse computation and scalability, SparseMixer provides scalable gradient approximations for these terms, enabling reliable gradient estimation in MoE training. Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations with negligible computational overhead.…
Peer Reviews
Decision·Submitted to ICLR 2024
(1) This paper discusses an very important research question in MoE, the backpropoagation issue of routing function, which is easily be overlooked by researchers if not be pointed out specifically. This research question is timely and important. (2) The first order approximation used in this paper only requires the output of one expert, not sacrificing scalability. (3) SparseMixer does not require hessian or other second-order derivatives, having negligible computation overheads.
(1) While SparseMixer achieves consistently improvement over the vanilla Switch Transformer, what I can see is the improvement is a bit marginal in Table1. Esp. as the number of experts increases, the performance gains become more marginal. My conjecture is that the main evaluation task in the paper, GLUE, is two simple to demonstrate the empirical benefits of SparseMixer. I would like to see more results on more challenging tasks, where the performance gains of S+S might be larger. (2) The ma
1. The idea of improving gradient computation at scale to improve MoE training is novel to me. 2.The paper consistently demonstrate the impact of neglecting $\Delta_0$ in the pre-training with MoE. 3. The paper is written well and the results back up the improvement.
1. Straight-Through (ST) -- > straight through estimator (STE) ? 2. Please define ODE first in the abstract before using the abbr. 3. Please introduce definitions of $\Delta_0$ and $\Delta_1$. 4. It is not quite clear why the training speed improves. 5. Please demonstrate results with other MoE gating methods. as few of them tried to improve MoE training.
1. The pre-training of LLMs is costly, and MoE is one promising sparse training method to reduce the training overhead. The paper makes some contributions to improving the backpropagation of MoE Training. 2. The experiment results show the effectiveness of the proposed methods on a specific model.
My concern includes two aspects: 1. The experiment is kind of weak since there are also some other MoE architecture and other base models, while the authors only focus on Switch. Currently, the author only focuses on a simplified setting of the switch Transformer layer (Fedus et al., 2021). However, there are also other popular MoE architectures, e.g. (Shazeer et al., 2017; Lepikhin et al., 2020; Lewis et al., 2021) as mentioned in the paper, and (Yanqi et al., 2022; Nan et al., 2022) in the fo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and ELM · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Label Smoothing · Absolute Position Encodings · Adam · Switch FFN · Residual Connection · Switch Transformer
