SinLlama -- A Large Language Model for Sinhala

H.W.K.Aravinda; Rashad Sirajudeen; Samith Karunathilake; Nisansa de Silva; Surangika Ranathunga; Rishemjit Kaur

arXiv:2508.09115·cs.CL·November 11, 2025

SinLlama -- A Large Language Model for Sinhala

H.W.K.Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, Rishemjit Kaur

PDF

1 Models

TL;DR

This paper introduces SinLlama, an open-source large language model specifically enhanced for Sinhala by extending an existing multilingual LLM with Sinhala vocabulary and continual pre-training, achieving superior performance on classification tasks.

Contribution

The paper presents the first open-source decoder-based LLM with explicit Sinhala support, extending Llama-3-8B with Sinhala vocabulary and fine-tuning for improved language-specific tasks.

Findings

01

SinLlama outperforms base Llama-3-8B variants on Sinhala classification tasks.

02

Enhanced Sinhala vocabulary improves language model understanding.

03

Continual pre-training on Sinhala corpus boosts model performance.

Abstract

Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
polyglots/SinLlama_v01
model· 824 dl· ♡ 46
824 dl♡ 46

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.