From Sparse to Soft Mixtures of Experts
Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Neil Houlsby

TL;DR
This paper introduces Soft MoE, a differentiable sparse Transformer that improves upon traditional MoEs by enabling implicit soft token assignment, leading to better scalability, stability, and performance in visual recognition tasks.
Contribution
The paper proposes Soft MoE, a fully-differentiable sparse Transformer that addresses key MoE challenges while maintaining scalability and performance benefits.
Findings
Soft MoE outperforms dense Transformers and other MoEs in visual recognition.
Soft MoE scales efficiently with over 40x more parameters at minimal inference cost.
Soft MoE demonstrates significant performance improvements in large-scale models.
Abstract
Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoEs, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity (and performance) at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms dense Transformers (ViTs) and popular MoEs (Tokens Choice and Experts Choice). Furthermore, Soft MoE…
Peer Reviews
Decision·ICLR 2024 spotlight
- Paper is well-written, easy to understand, and provided with source codes that model how Soft MoE is being implemented. - Experiments are comprehensive and detailed with respect to its domain, with emphasis not only on performance, but also in terms of inference and wall-clock time. - High potential to bring improvement when it is deployed towards various modalities such as Large-Language Models (LLMs).
Aside from the weaknesses mentioned in the paper, I would like to address concerns that is apparent in the paper: - The experiments performed in Sections 3 and 4 seem to focus only on vision-related tasks. It would be great to be able to observe results on different modalities such as NLP-related tasks based on GLUE or SuperGLUE benchmark that is performed in [1]. - Unfortunately, the dataset that is being used for training is not publicly available; making it hard to be used for benchmarking w
Originality The idea of this soft MOE is new. I have not encountered previous work that uses this idea. It combines the idea of sparse MOE with an attention-like mechanism to get the benefits of both. The major part that I like is that this combination is very practical in large scale model that requires model parallelism. The author also clearly contrasted its difference with other existing works and multi-head attention in Section 5. Quality The quality of the paper is high. - The author pr
as mentioned already by the author, the main weakness of this method from my perspective is: each expert do not handle multiple tokens well (i.e. one expert one token is better). In practice, this may cause inefficient increase of number of parameters (memory). But the authors have already recognized it, and I think it doesn't hurt the significance of the existing contribution of this paper.
The paper makes a clear contribution of an architectural improvement that requires extensive experimental justification to prove. It then provides the experimental results to back up this assertion on expensive benchmarks. The results are presented clearly and it is easy for the reader to find relevant information.
Minor presentation issue: - Figure 2 is confusing, the relationship between dispatch and combine weights to the tokens is illustrated with two downward arrows, but they don't really mean anything so it doesn't help the reader - Better signposting about all the different results that may be found in the experiments section as it is the most valuable part of the paper. For example, in the introduction some of the inference time and training time benefits are mentioned but not where these results
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Attention Dropout · Weight Decay · Position-Wise Feed-Forward Layer · Layer Normalization · Softmax
