FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models
Hao Kang, Zichun Yu, Chenyan Xiong

TL;DR
FLAME-MoE is an open-source, end-to-end research platform for Mixture-of-Experts language models, enabling detailed analysis and reproducible experiments on modern MoE architectures with improved accuracy over dense models.
Contribution
It introduces a fully open-source MoE research suite with detailed transparency, facilitating investigation into expert specialization, routing behavior, and scaling in language models.
Findings
Experts specialize on distinct token subsets.
Routing matrices remain sparse, indicating diverse expert usage.
Routing behavior stabilizes early in training.
Abstract
Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods
MethodsMixture of Experts · ADaptive gradient method with the OPTimal convergence rate
