MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish
Xin Huang, Tarun Kumar Vangani, Minh Duc Pham, Xunlong Zou, Bin Wang,, Zhengyuan Liu, Ai Ti Aw

TL;DR
MERaLiON-TextLLM introduces open-source multilingual models tailored for Chinese, Indonesian, Malay, and Singlish, enhancing cross-lingual understanding and outperforming baseline Llama-3 models through continued pre-training and weight merging.
Contribution
The paper presents a new series of open-source multilingual language models specifically optimized for underrepresented languages, with improved performance over existing models.
Findings
Performance improvements across multiple benchmarks in target languages.
Model checkpoints provided as resources for further research.
Enhanced understanding and generation capabilities in Chinese, Indonesian, Malay, and Singlish.
Abstract
Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MERaLiON/MERaLiON-3-10B-previewmodel· 322 dl· ♡ 1322 dl♡ 1
- 🤗MERaLiON/LLaMA-3-MERaLiON-8B-Instructmodel· 94 dl· ♡ 394 dl♡ 3
- 🤗MERaLiON/MERaLiON-2-10Bmodel· 711 dl· ♡ 11711 dl♡ 11
- 🤗MERaLiON/MERaLiON-2-3Bmodel· 2.6k dl· ♡ 52.6k dl♡ 5
- 🤗lewiswoncy/m_test_9model· 42 dl42 dl
- 🤗lewiswoncy/m_test_9_11model· 2 dl2 dl
- 🤗MERaLiON/MERaLiON-2-3B-MLXmodel· 8 dl8 dl
- 🤗MERaLiON/MERaLiON-2-10B-MLXmodel· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
