A Multitask Learning Approach for Diacritic Restoration
Sawsan Alqahtani, Ajay Mishra, Mona Diab

TL;DR
This paper presents a multitask learning approach for diacritic restoration in Arabic, jointly optimizing related NLP tasks to improve accuracy without relying on complex morphological analyzers.
Contribution
It introduces a novel multitask learning framework that enhances diacritic restoration by jointly modeling word segmentation, POS tagging, and syntactic diacritization.
Findings
Joint models outperform baseline methods.
Achieves performance comparable to complex state-of-the-art models.
Effective in low-resource or dialectal data scenarios.
Abstract
In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings. Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word. This results in a more ambiguous text making computational processing on such text more difficult. Diacritic restoration is the task of restoring missing diacritics in the written text. Most state-of-the-art diacritic restoration models are built on character level information which helps generalize the model to unseen data, but presumably lose useful information at the word level. Thus, to compensate for this loss, we investigate the use of multi-task learning to jointly optimize diacritic restoration with related NLP problems namely word segmentation, part-of-speech tagging, and syntactic diacritization. We use Arabic as a case study since it has sufficient data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
