Scaling Laws for Upcycling Mixture-of-Experts Language Models
Seng Pei Liew, Takuya Kato, Sho Takase

TL;DR
This paper investigates how to efficiently upcycle large language models into mixture-of-experts models, revealing empirical scaling laws and interaction effects that inform optimal training strategies under resource constraints.
Contribution
It introduces new empirical scaling laws for upcycling LLMs to MoE models and analyzes the interaction between dataset size and model configuration affecting efficiency.
Findings
Scaling laws describe performance dependence on dataset size and model configuration.
Interaction between dense and upcycled datasets limits efficiency at large budgets.
Guidelines for effective upcycling outperforming from-scratch training within resource limits.
Abstract
Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling
