Taming Sparsely Activated Transformer with Stochastic Experts
Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan,, Ruofei Zhang, Tuo Zhao, Jianfeng Gao

TL;DR
This paper introduces THOR, a stochastic expert-based Transformer model that randomly activates experts during training and inference, leading to improved parameter efficiency and better performance in machine translation tasks compared to traditional MoE models.
Contribution
The paper proposes a novel stochastic expert activation method in Transformers, demonstrating improved efficiency and performance over existing MoE models.
Findings
THOR outperforms the Transformer and MoE models in translation tasks.
THOR achieves comparable BLEU scores to larger MoE models with fewer parameters.
Random expert activation with consistency regularization enhances model performance.
Abstract
Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most on-going research focuses on improving SAMs models by exploring methods of routing inputs to experts, our analysis reveals that such research might not lead to the solution we expect, i.e., the commonly-used routing methods based on gating mechanisms do not work better than randomly routing inputs to experts. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts). Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference. THOR models are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Switch FFN · Residual Connection · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Switch Transformer · Label Smoothing
