Yuan3.0 Ultra: A Trillion-Parameter Enterprise-Oriented MoE LLM

YuanLab.ai: Shawn Wu; Jiangang Luo; Darcy Chen; Sean Wang; Louie Li; Allen Wang; Xudong Zhao; Tong Yu; Bach Li; Joseph Shen; Gawain Ma; Jasper Jia; Marcus Mao; Claire Wang; Hunter He; Carol Wang; Zera Zhang; Jason Wang; Chonly Shen; Leo Zhang; Logan Chen; Qasim Meng; James Gong; Daniel Zhao; Penn Zheng; Owen Zhu

arXiv:2601.14327·cs.LG·March 6, 2026

Yuan3.0 Ultra: A Trillion-Parameter Enterprise-Oriented MoE LLM

YuanLab.ai: Shawn Wu, Jiangang Luo, Darcy Chen, Sean Wang, Louie Li, Allen Wang, Xudong Zhao, Tong Yu, Bach Li, Joseph Shen, Gawain Ma, Jasper Jia, Marcus Mao, Claire Wang, Hunter He, Carol Wang, Zera Zhang, Jason Wang, Chonly Shen, Leo Zhang, Logan Chen, Qasim Meng, James Gong

PDF

Open Access

TL;DR

Yuan3.0 Ultra is a large, open-source MoE language model optimized for enterprise tasks, featuring a novel Layer-Adaptive Expert Pruning algorithm that improves training efficiency and reduces model size while maintaining high performance.

Contribution

The paper introduces Yuan3.0 Ultra, a trillion-parameter MoE LLM designed for enterprise applications, and proposes LAEP, a new expert pruning method that enhances pre-training efficiency and model compactness.

Findings

01

LAEP reduces model size by 33.3%

02

Pre-training efficiency improves by 49% with LAEP

03

Yuan3.0 Ultra achieves state-of-the-art results on enterprise benchmarks

Abstract

We introduce Yuan3.0 Ultra, an open-source Mixture-of-Experts (MoE) large language model featuring 68.8B activated parameters and 1010B total parameters, specially designed to enhance performance on enterprise scenarios tasks while maintaining competitive capabilities on general purpose tasks. We propose Layer-Adaptive Expert Pruning (LAEP) algorithm designed for the pre-training stage of MoE LLMs. In contrast to previous expert pruning approaches that operate primarily in the post-training phase, the proposed algorithm enhances training efficiency by selectively pruning underutilized experts and reorganizing experts across computing devices according to token distribution statistics. Comprehensive experiments demonstrate that LAEP effectively reduces model size and substantially improves pre-training efficiency. When pre-training Yuan3.0 Ultra from scratch original with 1515B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing · Topic Modeling