Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Romanized Scripts in a Real World Setting
Manurag Khullar, Utkarsh Desai, Poorva Malviya, Aman Dalmia, Zheyuan Ryan Shi

TL;DR
This study evaluates how romanized Indian language text affects LLM performance in healthcare triage, revealing significant reliability gaps and proposing a method to mitigate this issue in real-world applications.
Contribution
It quantifies the impact of romanization on LLM accuracy in healthcare and introduces an uncertainty-based routing method to improve reliability in multilingual settings.
Findings
Romanized messages cause up to 24-point performance degradation.
The proposed method reduces script gap and potential errors in triage.
Nearly 2 million excess errors could occur without mitigation.
Abstract
Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. Speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely quantifies or evaluates this orthographic variation in real world applications. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real world dataset of user-generated health queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with gap reaching up to 24 points across languages and models. We propose and evaluate an Uncertainty-based Selective Routing method to close this script gap. At our partner maternal health organization alone, this gap could cause nearly 2 million excess…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
