One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models
Benedikt Ebing, Lennart Keller, Goran Glava\v{s}

TL;DR
This study evaluates the impact of romanization on pretraining multilingual encoder models across diverse languages, finding minimal performance loss for segmental scripts and highlighting encoding efficiency benefits, with some degradation in morphosyllabic scripts.
Contribution
It provides a comprehensive analysis of romanization effects on high-resource languages during pretraining, addressing gaps in understanding its general applicability and impact on model performance.
Findings
Negligible performance loss for segmental scripts with romanization.
Degradation observed in morphosyllabic scripts mitigated by higher-fidelity romanization.
Romanization improves encoding efficiency without significant performance costs.
Abstract
Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization the most: (1) transfer from high-resource Latin-script to low-resource non-Latin-script languages and/or (2) between genealogically closely related languages with different scripts. It thus remains unclear whether romanization is a good representation choice for pretraining general-purpose mLMs, or, more precisely, if information loss associated with romanization harms performance for high-resource languages. We address this gap by pretraining encoder LMs from scratch on both romanized and original texts for six typologically diverse high-resource languages, investigating two potential sources of degradation: (i) loss of script-specific information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multilingual Education and Policy · Language and cultural evolution
