TL;DR
Expert upcycling enables scalable expansion of Mixture-of-Experts models during pre-training, improving efficiency by inheriting learned representations and selectively increasing capacity without retraining from scratch.
Contribution
The paper introduces expert upcycling, a novel method for expanding MoE models during pre-training, combining expert duplication with a theoretical framework and utility-based selection for improved efficiency.
Findings
Upcycled models match fixed-size baselines on validation loss.
Expert upcycling saves 32% of GPU hours in experiments.
Utility-based expert selection significantly improves gap closure.
Abstract
Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
