Don't Touch My Diacritics
Kyle Gorman, Yuval Pinter

TL;DR
This paper highlights the negative impact of inconsistent diacritic handling in NLP preprocessing and advocates for standardized practices to improve multilingual model performance and fairness.
Contribution
It emphasizes the importance of consistent diacritic processing in NLP and calls for community-wide adoption of better handling practices.
Findings
Inconsistent diacritic encoding harms model performance.
Removing diacritics can lead to loss of information.
Standardized diacritic handling improves multilingual NLP fairness.
Abstract
The common practice of preprocessing text before feeding it into NLP models introduces many decision points which have unintended consequences on model performance. In this opinion piece, we focus on the handling of diacritics in texts originating in many languages and scripts. We demonstrate, through several case studies, the adverse effects of inconsistent encoding of diacritized characters and of removing diacritics altogether. We call on the community to adopt simple but necessary steps across all models and toolkits in order to improve handling of diacritized text and, by extension, increase equity in multilingual NLP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsFocus
