# Improving Lemmatization of Non-Standard Languages with Joint Learning

**Authors:** Enrique Manjavacas, \'Akos K\'ad\'ar, Mike Kestemont

arXiv: 1903.06939 · 2019-03-19

## TL;DR

This paper introduces a joint learning approach using an encoder-decoder model with sentence context to improve lemmatization of non-standard and historical languages, outperforming existing methods without requiring POS tags.

## Contribution

It presents a novel joint training method for lemmatization and language modeling that enhances performance on non-standard languages without relying on POS or morphological annotations.

## Key findings

- Significant improvements over state-of-the-art in non-standard language lemmatization.
- Comparable or better results on standard languages.
- The model does not require POS or morphological annotations.

## Abstract

Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to improve lemmatization performance on a set of non-standard historical languages in which the difficulty is increased by an additional aspect (iii): spelling variation due to lacking orthographic standards. We approach lemmatization as a string-transduction task with an encoder-decoder architecture which we enrich with sentence context information using a hierarchical sentence encoder. We show significant improvements over the state-of-the-art when training the sentence encoder jointly for lemmatization and language modeling. Crucially, our architecture does not require POS or morphological annotations, which are not always available for historical corpora. Additionally, we also test the proposed model on a set of typologically diverse standard languages showing results on par or better than a model without enhanced sentence representations and previous state-of-the-art systems. Finally, to encourage future work on processing of non-standard varieties, we release the dataset of non-standard languages underlying the present study, based on openly accessible sources.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.06939/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1903.06939/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/1903.06939/full.md

---
Source: https://tomesphere.com/paper/1903.06939