Enhancing Sequence-to-Sequence Neural Lemmatization with External   Resources

Kirill Milintsevich; Kairit Sirts

arXiv:2101.12056·cs.CL·November 16, 2022

Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources

Kirill Milintsevich, Kairit Sirts

PDF

1 Repo

TL;DR

This paper introduces a hybrid neural lemmatization approach that incorporates external lexicon and rule-based resources, significantly improving accuracy across multiple languages compared to baseline models.

Contribution

The authors develop a novel seq2seq lemmatizer that learns to generate and copy lemmas using external resources, outperforming existing methods in multilingual settings.

Findings

01

Achieves 97.25% average accuracy on 23 UD languages.

02

Significantly outperforms baseline models without external resources.

03

Demonstrates the effectiveness of combining external lexicons with neural models.

Abstract

We propose a novel hybrid approach to lemmatization that enhances the seq2seq neural model with additional lemmas extracted from an external lexicon or a rule-based system. During training, the enhanced lemmatizer learns both to generate lemmas via a sequential decoder and copy the lemma characters from the external candidates supplied during run-time. Our lemmatizer enhanced with candidates extracted from the Apertium morphological analyzer achieves statistically significant improvements compared to baseline models not utilizing additional lemma information, achieves an average accuracy of 97.25% on a set of 23 UD languages, which is 0.55% higher than obtained with the Stanford Stanza model on the same set of languages. We also compare with other methods of integrating external data into lemmatization and show that our enhanced system performs considerably better than a simple lexicon…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

501Good/lexicon-enhanced-lemmatization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence