Don't Touch My Diacritics

Kyle Gorman; Yuval Pinter

arXiv:2410.24140·cs.CL·February 20, 2025

Don't Touch My Diacritics

Kyle Gorman, Yuval Pinter

PDF

Open Access 1 Video

TL;DR

This paper highlights the negative impact of inconsistent diacritic handling in NLP preprocessing and advocates for standardized practices to improve multilingual model performance and fairness.

Contribution

It emphasizes the importance of consistent diacritic processing in NLP and calls for community-wide adoption of better handling practices.

Findings

01

Inconsistent diacritic encoding harms model performance.

02

Removing diacritics can lead to loss of information.

03

Standardized diacritic handling improves multilingual NLP fairness.

Abstract

The common practice of preprocessing text before feeding it into NLP models introduces many decision points which have unintended consequences on model performance. In this opinion piece, we focus on the handling of diacritics in texts originating in many languages and scripts. We demonstrate, through several case studies, the adverse effects of inconsistent encoding of diacritized characters and of removing diacritics altogether. We call on the community to adopt simple but necessary steps across all models and toolkits in order to improve handling of diacritized text and, by extension, increase equity in multilingual NLP.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Don't Touch My Diacritics· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsFocus