TL;DR
This paper proposes framing Arabic lemmatization as a classification task into LPG tags, introduces a new diverse genre test set, and evaluates models showing classification and clustering outperform sequence-to-sequence methods in robustness and interpretability.
Contribution
It introduces a novel classification approach for Arabic lemmatization, a new diverse genre test set, and comprehensive evaluation of models demonstrating improved robustness and interpretability.
Findings
Classification and clustering methods outperform sequence-to-sequence models.
Sequence-to-sequence models are limited to lemma prediction and prone to hallucination.
New benchmarks set for Arabic lemmatization across multiple genres.
Abstract
Lemmatization is crucial for NLP tasks in morphologically rich languages with ambiguous orthography like Arabic, but existing tools face challenges due to inconsistent standards and limited genre coverage. This paper introduces two novel approaches that frame lemmatization as classification into a Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic clustering. We also present a new Arabic lemmatization test set covering diverse genres, standardized alongside existing datasets. We evaluate character level sequence-to-sequence models, which perform competitively and offer complementary value, but are limited to lemma prediction (not LPG) and prone to hallucinating implausible forms. Our results show that classification and clustering yield more robust, interpretable outputs, setting new benchmarks for Arabic lemmatization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
