On the Role of Morphological Information for Contextual Lemmatization

Olia Toporkov; Rodrigo Agerri

arXiv:2302.00407·cs.CL·October 23, 2023·1 cites

On the Role of Morphological Information for Contextual Lemmatization

Olia Toporkov, Rodrigo Agerri

PDF

Open Access

TL;DR

This study empirically investigates the impact of explicit morphological information on contextual lemmatization across six languages, finding that modern models often perform well without such features and that current evaluation methods may be insufficient.

Contribution

It challenges the assumption that detailed morphological features improve lemmatization, showing that modern contextual embeddings encode enough information and highlighting issues with current evaluation practices.

Findings

01

Morphological features have limited impact on lemmatization performance.

02

Simple UPOS tags can be as effective as detailed morphological features.

03

Current evaluation practices may not adequately differentiate model performance.

Abstract

Lemmatization is a natural language processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without considering whether that is the optimum in terms of downstream performance. In order to address this issue, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification