MaLLaM -- Malaysia Large Language Model
Husein Zolkepli, Aisyah Razak, Kamarul Adha, Ariff Nazhan

TL;DR
MaLLaM is a set of large language models trained from scratch on Malaysian data, demonstrating strong performance in understanding and generating Malay language tasks, and contributing to localized NLP advancements.
Contribution
This work introduces MaLLaM, the first large-scale Malay language models trained from scratch with up to 5 billion parameters, tailored for Malaysian language understanding and generation.
Findings
MaLLaM models perform competitively against ChatGPT3.5 and Mistral.
Instruction-tuned MaLLaM models show notable proficiency in language tasks.
Models effectively capture Malaysian linguistic nuances.
Abstract
Addressing the gap in Large Language Model pretrained from scratch with Malaysian context, We trained models with 1.1 billion, 3 billion, and 5 billion parameters on a substantial 349GB dataset, equivalent to 90 billion tokens based on our pretrained Byte Pair Encoding (BPE) tokenizer for a single epoch. MaLLaM contributes to enhanced natural language understanding and generation tasks in the Malay language. Although trained on a smaller dataset of 90 billion tokens, our instruction-tuned MaLLaM models perform competitively. When compared to ChatGPT3.5 and Malaysian Mistral, MaLLaM's instruction-tuned models demonstrate notable proficiency, underscoring the effectiveness of our approach in capturing and understanding the nuances of the Malaysian language. MaLLaM models mark a significant contribution to the field, providing comprehensive language representations grounded in Malaysian…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
