TL;DR
This paper introduces SinLlama, an open-source large language model specifically enhanced for Sinhala by extending an existing multilingual LLM with Sinhala vocabulary and continual pre-training, achieving superior performance on classification tasks.
Contribution
The paper presents the first open-source decoder-based LLM with explicit Sinhala support, extending Llama-3-8B with Sinhala vocabulary and fine-tuning for improved language-specific tasks.
Findings
SinLlama outperforms base Llama-3-8B variants on Sinhala classification tasks.
Enhanced Sinhala vocabulary improves language model understanding.
Continual pre-training on Sinhala corpus boosts model performance.
Abstract
Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
