Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling

Ning Liao; Xiaoxing Wang; Zehao Lin; Weiyang Guo; Feng Hong; Shixiang Song; Geng Yu; Zihua Zhao; Sitao Xie; Longxuan Wei; Xiangqi Jin; Xiaohan Qin; Jiale Ma; Kai Chen; Jiangchao Yao; Zhouhan Lin; Junchi Yan; Zhiyu Li; Feiyu Xiong; Yanfeng Wang; Linfeng Zhang

arXiv:2507.18671·cs.LG·October 17, 2025

Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling

Ning Liao, Xiaoxing Wang, Zehao Lin, Weiyang Guo, Feng Hong, Shixiang Song, Geng Yu, Zihua Zhao, Sitao Xie, Longxuan Wei, Xiangqi Jin, Xiaohan Qin, Jiale Ma, Kai Chen, Jiangchao Yao, Zhouhan Lin, Junchi Yan, Zhiyu Li, Feiyu Xiong, Yanfeng Wang, Linfeng Zhang

PDF

TL;DR

Innovator is a novel scientific LLM that uses a four-stage upcycling process to incorporate scientific knowledge across disciplines while preserving general capabilities, achieving significant improvements in scientific tasks.

Contribution

Innovator introduces a fine-grained MoE upcycling paradigm during continued pretraining to decouple scientific disciplines and maintain general performance.

Findings

01

Achieves 25% average improvement on 30 scientific tasks

02

Maintains 99% of general task performance

03

Exhibits over 30% improvement in scientific reasoning

Abstract

A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.