Prompting with Phonemes: Enhancing LLMs' Multilinguality for Non-Latin Script Languages
Hoang H Nguyen, Khyati Mahajan, Vikas Yadav, Julian Salazar, Philip S. Yu, Masoud Hashemi, Rishabh Maheshwary

TL;DR
This paper introduces a method that uses phonemic transcriptions to improve multilingual large language models' performance on non-Latin script languages by creating script-invariant representations, significantly closing the performance gap.
Contribution
The study proposes leveraging phonemic signals alongside orthographic scripts to enhance multilingual LLMs, especially for non-Latin scripts, and introduces a Mixed-ICL retrieval strategy for better in-context learning.
Findings
Phonemic signals improve non-Latin script language performance.
Mixed-ICL retrieval outperforms randomized retrieval.
Performance gains of up to 15.1% on non-Latin scripts.
Abstract
Although multilingual LLMs have achieved remarkable performance across benchmarks, we find they continue to underperform on non-Latin script languages across contemporary LLM families. This discrepancy arises from the fact that LLMs are pretrained with orthographic scripts, which are dominated by Latin characters that obscure their shared phonology with non-Latin scripts. We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. Our study demonstrates that integrating phonemic signals improves performance across both non-Latin and Latin script languages, with a particularly significant impact on closing the performance gap between the two. Through detailed experiments, we show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL). This motivates our proposed Mixed-ICL retrieval strategy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Text Readability and Simplification
