SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali
Ranidu Gurusinghe, Nevidu Jayatilleke

TL;DR
SiPaKosa is a large, high-quality corpus of Sinhala and Pali Buddhist texts created through OCR and web scraping, enabling advanced language modeling and cultural preservation.
Contribution
The paper introduces SiPaKosa, a comprehensive, annotated corpus of Sinhala and Pali texts, and evaluates language models on this new resource.
Findings
Proprietary models outperform open-source models by 3-6 times in perplexity.
The corpus contains approximately 786,000 sentences and 9.25 million words.
Evaluation of ten pretrained models demonstrates varying perplexity scores.
Abstract
SiPaKosa is a comprehensive corpus of Sinhala and Pali doctrinal texts comprising approximately 786K sentences and 9.25M words, incorporating 16 copyright-cleared historical Buddhist documents alongside the complete web-scraped Tripitaka canonical texts. The corpus was created through high-quality OCR using Google Document AI on historical manuscripts, combined with systematic web scraping of canonical repositories, followed by rigorous quality control and metadata annotation. The corpus is organised into language-specific subcorpora: Sinhala and Mixed Sinhala-Pali. We evaluate the performance of language models using ten pretrained models, with perplexity scores ranging from 1.09 to 189.67 on our corpus. This analysis shows that proprietary models significantly outperform open-source alternatives by factors of three to six times. This corpus supports the pretraining of domain-adapted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
