YuLan-Mini: An Open Data-efficient Language Model

Yiwen Hu; Huatong Song; Jia Deng; Jiapeng Wang; Jie Chen; Kun Zhou,; Yutao Zhu; Jinhao Jiang; Zican Dong; Wayne Xin Zhao; Ji-Rong Wen

arXiv:2412.17743·cs.CL·December 25, 2024

YuLan-Mini: An Open Data-efficient Language Model

Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou,, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen

PDF

Open Access 2 Repos 10 Models 4 Datasets

TL;DR

YuLan-Mini is a 2.42B parameter language model trained on 1.08 trillion tokens, achieving top-tier performance with efficient data pipeline, robust optimization, and targeted data selection techniques.

Contribution

The paper introduces novel training strategies and data handling methods that enable high performance of a smaller language model with less data and resources.

Findings

01

YuLan-Mini achieves performance comparable to larger models.

02

Efficient training is possible with optimized data pipeline and techniques.

03

Open release of data composition details for reproducibility.

Abstract

Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsBalanced Selection