TL;DR
This paper introduces a hybrid neural lemmatization approach that incorporates external lexicon and rule-based resources, significantly improving accuracy across multiple languages compared to baseline models.
Contribution
The authors develop a novel seq2seq lemmatizer that learns to generate and copy lemmas using external resources, outperforming existing methods in multilingual settings.
Findings
Achieves 97.25% average accuracy on 23 UD languages.
Significantly outperforms baseline models without external resources.
Demonstrates the effectiveness of combining external lexicons with neural models.
Abstract
We propose a novel hybrid approach to lemmatization that enhances the seq2seq neural model with additional lemmas extracted from an external lexicon or a rule-based system. During training, the enhanced lemmatizer learns both to generate lemmas via a sequential decoder and copy the lemma characters from the external candidates supplied during run-time. Our lemmatizer enhanced with candidates extracted from the Apertium morphological analyzer achieves statistically significant improvements compared to baseline models not utilizing additional lemma information, achieves an average accuracy of 97.25% on a set of 23 UD languages, which is 0.55% higher than obtained with the Stanford Stanza model on the same set of languages. We also compare with other methods of integrating external data into lemmatization and show that our enhanced system performs considerably better than a simple lexicon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
