Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights

Jakub Krajewski; Marcin Chochowski; Daniel Korzekwa

arXiv:2506.02890·cs.LG·June 4, 2025

Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights

Jakub Krajewski, Marcin Chochowski, Daniel Korzekwa

PDF

Open Access

TL;DR

This paper empirically evaluates fine-grained Mixture of Experts (MoE) architectures for large language models up to 56B parameters, demonstrating improved performance and practical training insights over standard MoE configurations.

Contribution

It provides a comprehensive empirical comparison of fine-grained MoE versus standard MoE, including training recipes and insights for scaling large models.

Findings

01

Fine-grained MoE achieves better validation loss.

02

Higher accuracy on downstream benchmarks.

03

Improved convergence speed at large scale.

Abstract

Mixture of Experts (MoE) architectures have emerged as pivotal for scaling Large Language Models (LLMs) efficiently. Fine-grained MoE approaches - utilizing more numerous, smaller experts - have demonstrated potential in improving model convergence and quality. This work proposes a set of training recipes and provides a comprehensive empirical evaluation of fine-grained MoE, directly comparing its scaling properties against standard MoE configurations for models with up to 56B total (17B active) parameters. We investigate convergence speed, model performance on downstream benchmarks, and practical training considerations across various setups. Overall, at the largest scale we show that fine-grained MoE achieves better validation loss and higher accuracy across a set of downstream benchmarks. This study offers empirical grounding and practical insights for leveraging fine-grained MoE in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Photocatalysis Techniques · Catalysis and Hydrodesulfurization Studies · Advancements in Battery Materials