YuLan: An Open-source Large Language Model
Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng, Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li,, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen,, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang

TL;DR
YuLan is an open-source 12-billion-parameter language model trained on diverse multilingual data, employing a three-stage curriculum learning approach, achieving state-of-the-art performance in English and Chinese benchmarks.
Contribution
This paper introduces YuLan, a large open-source multilingual LLM with a detailed training methodology including curriculum learning and instruction tuning, advancing transparency and reproducibility.
Findings
YuLan achieves performance comparable to state-of-the-art LLMs.
The three-stage training enhances multilingual and complex knowledge learning.
Open-source code and models promote further research.
Abstract
Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with billion parameters. The base model of YuLan is pre-trained on approximately T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection
