dots.llm1 Technical Report

Bi Huo; Bin Tu; Cheng Qin; Da Zheng; Debing Zhang; Dongjie Zhang; En Li; Fu Guo; Jian Yao; Jie Lou; Junfeng Tian; Li Hu; Ran Zhu; Shengdong Chen; Shuo Liu; Su Guang; Te Wo; Weijun Zhang; Xiaoming Shi; Xinxin Peng; Xing Wu; Yawen Liu; Yuqiu Ji; Ze Wen; Zhenhai Liu; Zichao Li; Zilong Liao

arXiv:2506.05767·cs.CL·June 9, 2025

dots.llm1 Technical Report

Bi Huo, Bin Tu, Cheng Qin, Da Zheng, Debing Zhang, Dongjie Zhang, En Li, Fu Guo, Jian Yao, Jie Lou, Junfeng Tian, Li Hu, Ran Zhu, Shengdong Chen, Shuo Liu, Su Guang, Te Wo, Weijun Zhang, Xiaoming Shi, Xinxin Peng, Xing Wu, Yawen Liu, Yuqiu Ji, Ze Wen, Zhenhai Liu, Zichao Li

PDF

Open Access 4 Models

TL;DR

dots.llm1 is a large-scale Mixture of Experts language model with 14B active parameters, achieving competitive performance with reduced costs, and providing open-source checkpoints for research.

Contribution

We introduce dots.llm1, a 142B parameter MoE model with efficient training, high performance, and open-source checkpoints, advancing large language model research.

Findings

01

Achieves performance comparable to larger models like Qwen2.5-72B.

02

Reduces training and inference costs through MoE architecture.

03

Provides open-source checkpoints at every trillion tokens.

Abstract

Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare