GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian
Aleksei Dorkin, Kairit Sirts

TL;DR
GliLem is a hybrid lemmatization system for Estonian that combines rule-based analysis with a neural disambiguation module, significantly improving lemmatization accuracy and downstream information retrieval performance.
Contribution
The paper introduces GliLem, a novel hybrid lemmatizer that leverages GliNER's open vocabulary NER model to enhance Estonian lemmatization accuracy by 10% over previous methods.
Findings
Lemmatization accuracy improved by 10% with GliNER integration.
Enhanced lemmatization leads to better IR recall, especially at high k.
Benchmarking shows significant IR metric improvements over stemming.
Abstract
We present GliLem -- a novel hybrid lemmatization system for Estonian that enhances the highly accurate rule-based morphological analyzer Vabamorf with an external disambiguation module based on GliNER -- an open vocabulary NER model that is able to match text spans with text labels in natural language. We leverage the flexibility of a pre-trained GliNER model to improve the lemmatization accuracy of Vabamorf by 10% compared to its original disambiguation module and achieve an improvement over the token classification-based baseline. To measure the impact of improvements in lemmatization accuracy on the information retrieval downstream task, we first created an information retrieval dataset for Estonian by automatically translating the DBpedia-Entity dataset from English. We benchmark several token normalization approaches, including lemmatization, on the created dataset using the BM25…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
