Comparison of Current Approaches to Lemmatization: A Case Study in   Estonian

Aleksei Dorkin; Kairit Sirts

arXiv:2404.15003·cs.CL·April 24, 2024

Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

Aleksei Dorkin, Kairit Sirts

PDF

Open Access

TL;DR

This paper compares three Estonian lemmatization methods—generative, pattern-based, and rule-based—finding that generative models outperform others and suggesting ensemble approaches for better accuracy.

Contribution

It provides a comparative analysis of different lemmatization approaches for Estonian, highlighting the superior performance of generative models and the potential of ensemble methods.

Findings

01

Generative models outperform pattern-based models in Estonian lemmatization.

02

Small overlap in errors suggests ensemble methods could improve accuracy.

03

Generative models consistently outperform other approaches in experiments.

Abstract

This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approaches could lead to improvements.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Second Language Acquisition and Learning