The Degree of Language Diacriticity and Its Effect on Tasks
Adi Cohen, Yuval Pinter

TL;DR
This paper introduces a data-driven, information-theoretic framework to quantify diacritic complexity across languages and examines its impact on diacritics restoration tasks using neural models.
Contribution
It provides the first cross-linguistic, corpus-based measurement of diacritic complexity and links these metrics to model performance in diacritic restoration.
Findings
Higher diacritic complexity correlates with lower restoration accuracy.
Structural complexity measures are more predictive in multi-diacritic scripts.
Frequency and structural measures align in single-diacritic scripts.
Abstract
Diacritics are orthographic marks that clarify pronunciation, distinguish similar words, or alter meaning. They play a central role in many writing systems, yet their impact on language technology has not been systematically quantified across scripts. While prior work has examined diacritics in individual languages, there's no cross-linguistic, data-driven framework for measuring the degree to which writing systems rely on them and how this affects downstream tasks. We propose a data-driven framework for quantifying diacritic complexity using corpus-level, information-theoretic metrics that capture the frequency, ambiguity, and structural diversity of character-diacritic combinations. We compute these metrics over 24 corpora in 15 languages, spanning both single- and multi-diacritic scripts. We then examine how diacritic complexity correlates with performance on the task of diacritics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
