Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Xinrun Du; Zhouliang Yu; Songyang Gao; Ding Pan; Yuyang Cheng; Ziyang; Ma; Ruibin Yuan; Xingwei Qu; Jiaheng Liu; Tianyu Zheng; Xinchen Luo; Guorui; Zhou; Wenhu Chen; and Ge Zhang

arXiv:2404.04167·cs.CL·September 16, 2024·5 cites

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang, Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui, Zhou, Wenhu Chen, and Ge Zhang

PDF

Open Access 10 Models 3 Datasets

TL;DR

This paper introduces CT-LLM, a 2-billion-parameter Chinese-centric large language model trained primarily on Chinese data, demonstrating superior Chinese language understanding and performance, and open-sourcing the training process and benchmarks.

Contribution

The study presents a novel Chinese-centric LLM trained from scratch with an extensive Chinese corpus, diverging from traditional English-focused training approaches.

Findings

01

Achieves high performance on Chinese language tasks

02

Demonstrates proficiency in English through supervised fine-tuning

03

Open-sources training data, process, and benchmarks

Abstract

In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsShrink and Fine-Tune