Correcting diacritics and typos with a ByT5 transformer model
Lukas Stankevi\v{c}ius, Mantas Luko\v{s}evi\v{c}ius, Jurgita, Kapo\v{c}i\=ut\.e-Dzikien\.e, Monika Briedien\.e, Tomas Krilavi\v{c}ius

TL;DR
This paper presents a universal byte-level transformer model that simultaneously restores diacritics and corrects typos across multiple languages, outperforming traditional methods and demonstrating high accuracy with less data.
Contribution
The study introduces a single ByT5 transformer approach for combined diacritics restoration and typos correction, applicable to multiple languages without language-specific modifications.
Findings
Achieves over 98% accuracy in diacritics restoration on benchmark datasets.
Restores diacritics in unseen words with over 76% accuracy.
Reaches over 94% alpha-word accuracy in combined diacritics and typo correction across 13 languages.
Abstract
Due to the fast pace of life and online communications and the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing in other languages. Restoring diacritics and correcting spelling is important for proper language use and the disambiguation of texts for both humans and downstream algorithms. However, both of these problems are typically addressed separately: the state-of-the-art diacritics restoration methods do not tolerate other typos, but classical spellcheckers also cannot deal adequately with all the diacritics missing. In this work, we tackle both problems at once by employing the newly-developed universal ByT5 byte-level seq2seq transformer model that requires no language-specific model structures. For a comparison, we perform diacritics restoration on benchmark datasets of 12 languages, with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
