Xmodel-1.5: An 1B-scale Multilingual LLM
Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling

TL;DR
Xmodel-1.5 is a 1-billion-parameter multilingual LLM with a custom tokenizer, achieving competitive and state-of-the-art results across multiple languages and benchmarks, and supporting low-resource language research.
Contribution
The paper introduces Xmodel-1.5, a scalable multilingual LLM with a novel unigram tokenizer and a Thai-specific dataset, advancing multilingual AI performance and research.
Findings
Outperforms Alibaba's PolyLM-1.7B on multiple languages
Achieves state-of-the-art results in Thai language benchmarks
Demonstrates strong performance on mMMLU and PIQA benchmarks
Abstract
We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language model pretrained on 2 trillion tokens, designed for balanced performance and scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy. The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English, outperforming Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai. To support low-resource language research, we release Xdata_Thai, a Thai-specific evaluation dataset featuring unique linguistic challenges such as gendered particles and idioms. While the model demonstrates strong performance, there is still room for improvement in handling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsByte Pair Encoding
