YuLan: An Open-source Large Language Model

Yutao Zhu; Kun Zhou; Kelong Mao; Wentong Chen; Yiding Sun; Zhipeng; Chen; Qian Cao; Yihan Wu; Yushuo Chen; Feng Wang; Lei Zhang; Junyi Li,; Xiaolei Wang; Lei Wang; Beichen Zhang; Zican Dong; Xiaoxue Cheng; Yuhan Chen,; Xinyu Tang; Yupeng Hou; Qiangqiang Ren; Xincheng Pang; Shufang Xie; Wayne Xin; Zhao; Zhicheng Dou; Jiaxin Mao; Yankai Lin; Ruihua Song; Jun Xu; Xu Chen; Rui; Yan; Zhewei Wei; Di Hu; Wenbing Huang; Ze-Feng Gao; Yueguo Chen; Weizheng Lu,; Ji-Rong Wen

arXiv:2406.19853·cs.CL·July 1, 2024

YuLan: An Open-source Large Language Model

Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng, Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li,, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen,, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

YuLan is an open-source 12-billion-parameter language model trained on diverse multilingual data, employing a three-stage curriculum learning approach, achieving state-of-the-art performance in English and Chinese benchmarks.

Contribution

This paper introduces YuLan, a large open-source multilingual LLM with a detailed training methodology including curriculum learning and instruction tuning, advancing transparency and reproducibility.

Findings

01

YuLan achieves performance comparable to state-of-the-art LLMs.

02

The three-stage training enhances multilingual and complex knowledge learning.

03

Open-source code and models promote further research.

Abstract

Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$ T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruc-gsai/yulan-chat
pytorchOfficial

Models

Datasets

Tiiny/PowerCoding
dataset· 431 dl
431 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsBalanced Selection