FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data
Haoran Sun, Renren Jin, Shaoyang Xu, Leiyu Pan, Supryadi, Menglong, Cui, Jiangcun Du, Yikun Lei, Lei Yang, Ling Shi, Juesi Xiao, Shaolin Zhu,, Deyi Xiong

TL;DR
FuxiTranyu is a balanced, multilingual large language model with 8 billion parameters, trained on diverse data covering 43 natural and 16 programming languages, achieving competitive performance and consistent representations across languages.
Contribution
The paper introduces FuxiTranyu, a new open-source multilingual LLM trained on balanced data, with instruction tuning and alignment, advancing multilingual capabilities and interpretability.
Findings
FuxiTranyu outperforms existing multilingual LLMs on various benchmarks.
The model demonstrates consistent multilingual representations across languages.
Open-source release facilitates further research in multilingual NLP.
Abstract
Large language models (LLMs) have demonstrated prowess in a wide range of tasks. However, many LLMs exhibit significant performance discrepancies between high- and low-resource languages. To mitigate this challenge, we present FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the need of the research community for balanced and high-performing multilingual capabilities. The base model, FuxiTranyu-8B, features 8 billion parameters and is trained from scratch on meticulously balanced multilingual data that contains 600 billion tokens covering 43 natural languages and 16 programming languages. We also develop two instruction-tuned models: FuxiTranyu-8B-SFT which is fine-tuned on a diverse multilingual instruction dataset, and FuxiTranyu-8B-DPO which is further refined with DPO on a preference dataset for enhanced alignment ability. Extensive experiments on a wide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsDirect Preference Optimization · Balanced Selection
