TL;DR
This paper explores how the sparsity in Mixture-of-Experts language models affects reasoning and memorization, revealing that optimal sparsity depends on active compute and data efficiency, which challenges traditional scaling laws.
Contribution
It introduces a new understanding of MoE sparsity, showing that optimal performance depends on active FLOPs and tokens per parameter, and provides empirical evidence to guide model design.
Findings
Active FLOPs correlates with higher reasoning accuracy.
Memorization improves with more parameters, reasoning benefits from optimal TPP.
Reinforcement learning post-training does not change these trends.
Abstract
Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top- routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit…
Peer Reviews
Decision·ICLR 2026 Oral
- The paper is well-written and easy to understand. - The experiments are comprehensive while supporting the major claims of the paper. - One of the main findings is surprising, as it shows that higher sparsity only improves performance under memorization instead of reasoning tasks under the iso-FLOP settings.
- The paper might need to address more about its intuition and originality from previous works such as [1] and [2], since similar observations regarding the optimal sparsity in MoE models have been made. - Theoretical insights are encouraged to explain the experimental findings. - The U-shape trend plot for reasoning tasks in Figure 2 is very interesting, and I suggest the authors to verify such finding under more reasoning tasks. [1] Samira, Abnar, et al. "Parameters vs FLOPs: Scaling Laws for
1. It is an important observation that for MoE models, downstream accuracy can deviate from the predictions of conventional scaling laws, and these deviations may vary across different tasks. 2. Exhaustive experimentation is done in reasoning and coding tasks to demonstrate the U shape of tasks performance with the increase of total parameters at a FLOP controlled setting 3. Exhaustive experiments are done to show that post training couldn't improve this.
1. The number of tokens used seems small to if we are targeting End task performance, specially for MOE models 2. It would be good to get some ablation for various router choices, though than can be a future work 3. In Page 9, figure 8, it would be good do the study at k>1 (ideally 8) and E >8 4. More details about the post training setup is helpful. How many tokens in the post training set? 5. No details have been provided whether Continuous training is done or learning rate is annealed before
- The paper studies the effect of sparsity in MoEs on downstream tasks. This area has not been examined in detail so the study in the paper is timely and will likely be of interest to many researchers & practitioners. - The empirical setup including models, data and downstream tasks are described clearly in the paper. This gives me confidence that the experiments are reproducible. - The models considered in the paper are not necessarily compute-optimal. This detail may provide additional insi
- The paper considers a single architecture inspired by Mixtral family of MoEs in the work. It's understandable why this choice was made (experiment volume) but I do wonder if other architecture choices can change the conclusions made here. If possible, please discuss why Mixtral was chosen as opposed to other choices. - The fact that memorization depends on total parameter count is known from prior literature. Furthermore, active number of parameters (inference FLOPs) have also observed to im
Code & Models
- 🤗llm-jp/optimal-sparsity-math-d512-E8-k2-320M-A170Mmodel· 4 dl4 dl
- 🤗llm-jp/optimal-sparsity-math-d512-E16-k2-520M-A170Mmodel· 3 dl3 dl
- 🤗llm-jp/optimal-sparsity-math-d2048-E8-k2-3.9B-A1.5Bmodel· 19 dl19 dl
- 🤗llm-jp/optimal-sparsity-math-d2048-E16-k2-7.1B-A1.5Bmodel· 4 dl4 dl
- 🤗llm-jp/optimal-sparsity-math-d2048-E32-k2-13.6B-A1.5Bmodel· 2 dl2 dl
- 🤗llm-jp/optimal-sparsity-math-d2048-E16-k16-7.1B-A7.1Bmodel· 1 dl1 dl
- 🤗llm-jp/optimal-sparsity-math-d2048-E8-k8-3.9B-A3.9Bmodel· 2 dl2 dl
- 🤗llm-jp/optimal-sparsity-math-d2048-E64-k2-26.4B-A1.5Bmodel· 3 dl3 dl
- 🤗llm-jp/optimal-sparsity-math-d2048-E32-k16-13.6B-A7.1Bmodel· 1 dl1 dl
- 🤗llm-jp/optimal-sparsity-math-d2048-E16-k8-7.1B-A3.9Bmodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
