YuLan-Mini: An Open Data-efficient Language Model
Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou,, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen

TL;DR
YuLan-Mini is a 2.42B parameter language model trained on 1.08 trillion tokens, achieving top-tier performance with efficient data pipeline, robust optimization, and targeted data selection techniques.
Contribution
The paper introduces novel training strategies and data handling methods that enable high performance of a smaller language model with less data and resources.
Findings
YuLan-Mini achieves performance comparable to larger models.
Efficient training is possible with optimized data pipeline and techniques.
Open release of data composition details for reproducibility.
Abstract
Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗yulan-team/YuLan-Minimodel· 15 dl· ♡ 3815 dl♡ 38
- 🤗yulan-team/YuLan-Mini-Before-Annealingmodel· 5 dl· ♡ 75 dl♡ 7
- 🤗yulan-team/YuLan-Mini-Phase20model· 4 dl· ♡ 24 dl♡ 2
- 🤗QuantFactory/YuLan-Mini-GGUFmodel· 202 dl· ♡ 2202 dl♡ 2
- 🤗yulan-team/YuLan-Mini-Instructmodel· 144 dl· ♡ 7144 dl♡ 7
- 🤗yulan-team/reasoning-classifiermodel· 4 dl· ♡ 24 dl♡ 2
- 🤗yulan-team/math-classifiermodel· 7 dl· ♡ 17 dl♡ 1
- 🤗yulan-team/code-classifiermodel· 21 dl· ♡ 121 dl♡ 1
- 🤗yulan-team/YuLan-Mini-Phase15model
- 🤗diskrot/YuLan-Mini-diskrotmodel· 79 dl79 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection
