Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model
Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang, Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui, Zhou, Wenhu Chen, and Ge Zhang

TL;DR
This paper introduces CT-LLM, a 2-billion-parameter Chinese-centric large language model trained primarily on Chinese data, demonstrating superior Chinese language understanding and performance, and open-sourcing the training process and benchmarks.
Contribution
The study presents a novel Chinese-centric LLM trained from scratch with an extensive Chinese corpus, diverging from traditional English-focused training approaches.
Findings
Achieves high performance on Chinese language tasks
Demonstrates proficiency in English through supervised fine-tuning
Open-sources training data, process, and benchmarks
Abstract
In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗m-a-p/CT-LLM-intermediate-ckptsmodel· ♡ 1♡ 1
- 🤗m-a-p/CT-LLM-SFT-DPOmodel· 13 dl· ♡ 513 dl♡ 5
- 🤗m-a-p/CT-LLM-SFTmodel· 21 dl· ♡ 121 dl♡ 1
- 🤗m-a-p/CT-LLM-Basemodel· 17 dl· ♡ 1117 dl♡ 11
- 🤗m-a-p/CT-LLM-SFT-experiment-ckptsmodel
- 🤗sunatte/txt2sqlmodel
- 🤗MachoMaheen/devdock4bitmodel
- 🤗RichardErkhov/m-a-p_-_CT-LLM-SFT-awqmodel· 1 dl1 dl
- 🤗RichardErkhov/m-a-p_-_CT-LLM-SFT-DPO-awqmodel· 1 dl1 dl
- 🤗RichardErkhov/m-a-p_-_CT-LLM-Base-awqmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsShrink and Fine-Tune
