SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali

Ranidu Gurusinghe; Nevidu Jayatilleke

arXiv:2603.29221·cs.CL·April 1, 2026

SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali

Ranidu Gurusinghe, Nevidu Jayatilleke

PDF

2 Datasets

TL;DR

SiPaKosa is a large, high-quality corpus of Sinhala and Pali Buddhist texts created through OCR and web scraping, enabling advanced language modeling and cultural preservation.

Contribution

The paper introduces SiPaKosa, a comprehensive, annotated corpus of Sinhala and Pali texts, and evaluates language models on this new resource.

Findings

01

Proprietary models outperform open-source models by 3-6 times in perplexity.

02

The corpus contains approximately 786,000 sentences and 9.25 million words.

03

Evaluation of ten pretrained models demonstrates varying perplexity scores.

Abstract

SiPaKosa is a comprehensive corpus of Sinhala and Pali doctrinal texts comprising approximately 786K sentences and 9.25M words, incorporating 16 copyright-cleared historical Buddhist documents alongside the complete web-scraped Tripitaka canonical texts. The corpus was created through high-quality OCR using Google Document AI on historical manuscripts, combined with systematic web scraping of canonical repositories, followed by rigorous quality control and metadata annotation. The corpus is organised into language-specific subcorpora: Sinhala and Mixed Sinhala-Pali. We evaluate the performance of language models using ten pretrained models, with perplexity scores ranging from 1.09 to 189.67 on our corpus. This analysis shows that proprietary models significantly outperform open-source alternatives by factors of three to six times. This corpus supports the pretraining of domain-adapted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.