Scaling Embeddings Outperforms Scaling Experts in Language Models

Hong Liu; Jiaqi Zhang; Chao Wang; Xing Hu; Linkun Lyu; Jiaqi Sun; Xurui Yang; Bo Wang; Fengcun Li; Yulei Qian; Lingtong Si; Yerui Sun; Rumei Li; Peng Pei; Yuchen Xie; Xunliang Cai

arXiv:2601.21204·cs.CL·February 12, 2026

Scaling Embeddings Outperforms Scaling Experts in Language Models

Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Yang, Bo Wang, Fengcun Li, Yulei Qian, Lingtong Si, Yerui Sun, Rumei Li, Peng Pei, Yuchen Xie, Xunliang Cai

PDF

Open Access 3 Models

TL;DR

This paper demonstrates that scaling embeddings can outperform expert scaling in large language models, offering better efficiency and performance in certain regimes, with practical system optimizations and a new 68.5B parameter model.

Contribution

The study introduces embedding scaling as an effective alternative to expert scaling, providing a comprehensive analysis, system optimizations, and a new large-scale model that surpasses MoE baselines.

Findings

01

Embedding scaling achieves a superior Pareto frontier in certain regimes.

02

System optimizations enable tangible inference speedups.

03

LongCat-Flash-Lite outperforms MoE baselines and is competitive with existing models.

Abstract

While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Artificial Intelligence in Healthcare and Education