Xmodel-1.5: An 1B-scale Multilingual LLM

Wang Qun; Liu Yang; Lin Qingquan; Jiang Ling

arXiv:2411.10083·cs.CL·December 5, 2024

Xmodel-1.5: An 1B-scale Multilingual LLM

Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling

PDF

Open Access 1 Repo 1 Models

TL;DR

Xmodel-1.5 is a 1-billion-parameter multilingual LLM with a custom tokenizer, achieving competitive and state-of-the-art results across multiple languages and benchmarks, and supporting low-resource language research.

Contribution

The paper introduces Xmodel-1.5, a scalable multilingual LLM with a novel unigram tokenizer and a Thai-specific dataset, advancing multilingual AI performance and research.

Findings

01

Outperforms Alibaba's PolyLM-1.7B on multiple languages

02

Achieves state-of-the-art results in Thai language benchmarks

03

Demonstrates strong performance on mMMLU and PIQA benchmarks

Abstract

We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language model pretrained on 2 trillion tokens, designed for balanced performance and scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy. The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English, outperforming Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai. To support low-resource language research, we release Xdata_Thai, a Thai-specific evaluation dataset featuring unique linguistic challenges such as gendered particles and idioms. While the model demonstrates strong performance, there is still room for improvement in handling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

XiaoduoAILab/XmodelLM-1.5
pytorchOfficial

Models

🤗
XiaoduoAILab/XmodelLM1.5
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies

MethodsByte Pair Encoding