One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models

Benedikt Ebing; Lennart Keller; Goran Glava\v{s}

arXiv:2601.05776·cs.CL·January 12, 2026

One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models

Benedikt Ebing, Lennart Keller, Goran Glava\v{s}

PDF

Open Access

TL;DR

This study evaluates the impact of romanization on pretraining multilingual encoder models across diverse languages, finding minimal performance loss for segmental scripts and highlighting encoding efficiency benefits, with some degradation in morphosyllabic scripts.

Contribution

It provides a comprehensive analysis of romanization effects on high-resource languages during pretraining, addressing gaps in understanding its general applicability and impact on model performance.

Findings

01

Negligible performance loss for segmental scripts with romanization.

02

Degradation observed in morphosyllabic scripts mitigated by higher-fidelity romanization.

03

Romanization improves encoding efficiency without significant performance costs.

Abstract

Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization the most: (1) transfer from high-resource Latin-script to low-resource non-Latin-script languages and/or (2) between genealogically closely related languages with different scripts. It thus remains unclear whether romanization is a good representation choice for pretraining general-purpose mLMs, or, more precisely, if information loss associated with romanization harms performance for high-resource languages. We address this gap by pretraining encoder LMs from scratch on both romanized and original texts for six typologically diverse high-resource languages, investigating two potential sources of degradation: (i) loss of script-specific information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multilingual Education and Policy · Language and cultural evolution