Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging
Nasser Zalmout, Nizar Habash

TL;DR
This paper presents a joint modeling approach for diacritization, lemmatization, normalization, and morphological tagging in Semitic languages, improving accuracy especially for dialectal Arabic.
Contribution
It introduces a unified model that handles lexicalized and non-lexicalized features at different granularities, achieving state-of-the-art results for Arabic.
Findings
20% relative error reduction for Modern Standard Arabic
11% error reduction for Egyptian Arabic
Effective joint modeling of multiple morphological features
Abstract
Semitic languages can be highly ambiguous, having several interpretations of the same surface forms, and morphologically rich, having many morphemes that realize several morphological features. This is further exacerbated for dialectal content, which is more prone to noise and lacks a standard orthography. The morphological features can be lexicalized, like lemmas and diacritized forms, or non-lexicalized, like gender, number, and part-of-speech tags, among others. Joint modeling of the lexicalized and non-lexicalized features can identify more intricate morphological patterns, which provide better context modeling, and further disambiguate ambiguous lexical choices. However, the different modeling granularity can make joint modeling more difficult. Our approach models the different features jointly, whether lexicalized (on the character-level), where we also model surface form…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest
