Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Chaitanya Dwivedi; Binxuan Huang; Himanshu Gupta; Pratik Jayarao; Neeraj Varshney; Bing Yin

arXiv:2604.19835·cs.LG·May 12, 2026

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Chaitanya Dwivedi, Binxuan Huang, Himanshu Gupta, Pratik Jayarao, Neeraj Varshney, Bing Yin

PDF

1 Repo

TL;DR

Expert upcycling enables scalable expansion of Mixture-of-Experts models during pre-training, improving efficiency by inheriting learned representations and selectively increasing capacity without retraining from scratch.

Contribution

The paper introduces expert upcycling, a novel method for expanding MoE models during pre-training, combining expert duplication with a theoretical framework and utility-based selection for improved efficiency.

Findings

01

Upcycled models match fixed-size baselines on validation loss.

02

Expert upcycling saves 32% of GPU hours in experiments.

03

Utility-based expert selection significantly improves gap closure.

Abstract

Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/expert-upcycling
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.