Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Samira Abnar; Harshay Shah; Dan Busbridge; Alaaeldin Mohamed Elnouby Ali; Josh Susskind; and Vimal Thilak

arXiv:2501.12370·cs.LG·July 4, 2025

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, and Vimal Thilak

PDF

Open Access

TL;DR

This paper investigates how the sparsity level in Mixture-of-Experts language models affects their performance and training efficiency, revealing optimal sparsity points that enhance scaling laws and model capabilities.

Contribution

It provides new insights into the relationship between sparsity, parameters, and FLOPs in MoEs, identifying optimal sparsity levels for improved efficiency and performance.

Findings

01

Optimal sparsity improves training efficiency.

02

Sparsity levels affect downstream task performance.

03

Scaling laws are influenced by sparsity constraints.

Abstract

Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Experts (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation. We find that under different constraints (e.g., parameter size and total training compute), there is an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Bayesian Methods and Mixture Models · Machine Learning and Algorithms