From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan

Lei Yang; Leiyu Pan; Bojian Xiong; Renren Jin; Shaowei Zhang; Yue Chen; Ling Shi; Jiang Zhou; Junru Wu; Zhen Wang; Jianxiang Peng; Juesi Xiao; Tianyu Dong; Zhuowen Han; Zhuo Chen; Yuqi Ren; Deyi Xiong

arXiv:2507.09205·cs.CL·May 14, 2026

From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan

Lei Yang, Leiyu Pan, Bojian Xiong, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong

PDF

TL;DR

This paper develops a large-scale Tibetan language model using extensive data curation, continual pre-training, and Mixture-of-Experts architecture, significantly improving performance on Tibetan NLP tasks.

Contribution

It introduces the largest Tibetan corpus to date and adapts Qwen2.5-7B with continual pre-training and multilingual instruction tuning, extending to a 50B Mixture-of-Experts model.

Findings

01

Models outperform existing Tibetan-focused models of similar scale.

02

Constructed multiple high-quality Tibetan evaluation datasets.

03

Demonstrated effective scaling and adaptation for low-resource language modeling.

Abstract

Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.