Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala
Minuri Rajapakse, Ruvan Weerasinghe

TL;DR
This paper benchmarks 24 open-source language models on Sinhala across Unicode, Romanized, and mixed scripts, revealing significant script sensitivity and challenging assumptions about model size and robustness.
Contribution
It provides the first comprehensive evaluation of language models on Sinhala's diverse scripts, highlighting their script sensitivity and offering practical insights for low-resource language modeling.
Findings
Median performance drops over 300 times from Unicode to Romanized text.
Smaller models often outperform larger architectures in script handling.
Unicode performance predicts mixed-script robustness but not Romanized capability.
Abstract
The performance of Language Models (LMs) on low-resource, morphologically rich languages like Sinhala remains largely unexplored, particularly regarding script variation in digital communication. Sinhala exhibits script duality, with Unicode used in formal contexts and Romanized text dominating social media, while mixed-script usage is common in practice. This paper benchmarks 24 open-source LMs on Unicode, Romanized and mixed-script Sinhala using perplexity evaluation across diverse text sources. Results reveal substantial script sensitivity, with median performance degradation exceeding 300 times from Unicode to Romanized text. Critically, model size shows no correlation with script-handling competence, as smaller models often outperform architectures 28 times larger. Unicode performance strongly predicts mixed-script robustness but not Romanized capability, demonstrating that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
