FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

Hao Kang; Zichun Yu; Chenyan Xiong

arXiv:2505.20225·cs.CL·May 27, 2025

FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

Hao Kang, Zichun Yu, Chenyan Xiong

PDF

Open Access 1 Repo 7 Models 1 Datasets

TL;DR

FLAME-MoE is an open-source, end-to-end research platform for Mixture-of-Experts language models, enabling detailed analysis and reproducible experiments on modern MoE architectures with improved accuracy over dense models.

Contribution

It introduces a fully open-source MoE research suite with detailed transparency, facilitating investigation into expert specialization, routing behavior, and scaling in language models.

Findings

01

Experts specialize on distinct token subsets.

02

Routing matrices remain sparse, indicating diverse expert usage.

03

Routing behavior stabilizes early in training.

Abstract

Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cmu-flame/flame-moe
pytorchOfficial

Models

Datasets

CMU-FLAME/FLAME-MoE-Traces
dataset· 57 dl
57 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods

MethodsMixture of Experts · ADaptive gradient method with the OPTimal convergence rate