Comparison of Current Approaches to Lemmatization: A Case Study in Estonian
Aleksei Dorkin, Kairit Sirts

TL;DR
This paper compares three Estonian lemmatization methods—generative, pattern-based, and rule-based—finding that generative models outperform others and suggesting ensemble approaches for better accuracy.
Contribution
It provides a comparative analysis of different lemmatization approaches for Estonian, highlighting the superior performance of generative models and the potential of ensemble methods.
Findings
Generative models outperform pattern-based models in Estonian lemmatization.
Small overlap in errors suggests ensemble methods could improve accuracy.
Generative models consistently outperform other approaches in experiments.
Abstract
This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approaches could lead to improvements.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Second Language Acquisition and Learning
