Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights
Jakub Krajewski, Marcin Chochowski, Daniel Korzekwa

TL;DR
This paper empirically evaluates fine-grained Mixture of Experts (MoE) architectures for large language models up to 56B parameters, demonstrating improved performance and practical training insights over standard MoE configurations.
Contribution
It provides a comprehensive empirical comparison of fine-grained MoE versus standard MoE, including training recipes and insights for scaling large models.
Findings
Fine-grained MoE achieves better validation loss.
Higher accuracy on downstream benchmarks.
Improved convergence speed at large scale.
Abstract
Mixture of Experts (MoE) architectures have emerged as pivotal for scaling Large Language Models (LLMs) efficiently. Fine-grained MoE approaches - utilizing more numerous, smaller experts - have demonstrated potential in improving model convergence and quality. This work proposes a set of training recipes and provides a comprehensive empirical evaluation of fine-grained MoE, directly comparing its scaling properties against standard MoE configurations for models with up to 56B total (17B active) parameters. We investigate convergence speed, model performance on downstream benchmarks, and practical training considerations across various setups. Overall, at the largest scale we show that fine-grained MoE achieves better validation loss and higher accuracy across a set of downstream benchmarks. This study offers empirical grounding and practical insights for leveraging fine-grained MoE in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Photocatalysis Techniques · Catalysis and Hydrodesulfurization Studies · Advancements in Battery Materials
