Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin,, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu,, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul,, Fajri Koto, Min Si Thu, Hynek Kydl\'i\v{c}ek, Zeyi Liu

TL;DR
Sailor2 is a family of multilingual language models tailored for South-East Asian languages, achieving competitive performance and providing a comprehensive guide for efficient development of inclusive LLMs.
Contribution
It introduces Sailor2 models with extensive SEA language support and offers a detailed cookbook for developing inclusive multilingual LLMs.
Findings
Sailor2-20B outperforms GPT-4o on SEA languages.
Models support 13 SEA languages while maintaining Chinese and English proficiency.
Provides a practical methodology for building inclusive multilingual LLMs.
Abstract
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗sail/Sailor2-1B-Premodel· 8 dl8 dl
- 🤗sail/Sailor2-20B-Premodel· 5 dl5 dl
- 🤗sail/Sailor2-8B-Premodel· 6 dl6 dl
- 🤗sail/Sailor2-8B-Chatmodel· 9.6k dl· ♡ 199.6k dl♡ 19
- 🤗sail/Sailor2-1Bmodel· 133 dl· ♡ 7133 dl♡ 7
- 🤗sail/Sailor2-8Bmodel· 184 dl· ♡ 8184 dl♡ 8
- 🤗sail/Sailor2-20B-Chat-1203model· 11 dl· ♡ 2411 dl♡ 24
- 🤗sail/Sailor2-20Bmodel· 22 dl· ♡ 1022 dl♡ 10
- 🤗sail/Sailor2-1B-Chatmodel· 1.0k dl· ♡ 161.0k dl♡ 16
- 🤗cortexso/sailor-2model· 186 dl186 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
