A Simple Joint Model for Improved Contextual Neural Lemmatization

Chaitanya Malaviya; Shijie Wu; Ryan Cotterell

arXiv:1904.02306·cs.CL·May 29, 2024

A Simple Joint Model for Improved Contextual Neural Lemmatization

Chaitanya Malaviya, Shijie Wu, Ryan Cotterell

PDF

TL;DR

This paper introduces a simple joint neural model for lemmatization and morphological tagging that improves accuracy across 20 languages, especially in low-resource and morphologically complex languages.

Contribution

A novel joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on multiple languages.

Findings

01

Joint modeling improves lemmatization accuracy in low-resource languages.

02

The model performs well on morphologically complex languages.

03

Code and models are publicly available.

Abstract

English verbs have multiple forms. For instance, talk may also appear as talks, talked or talking, depending on the context. The NLP task of lemmatization seeks to map these diverse forms back to a canonical one, known as the lemma. We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages from the Universal Dependencies corpora. Our paper describes the model in addition to training and decoding procedures. Error analysis indicates that joint morphological tagging and lemmatization is especially helpful in low-resource lemmatization and languages that display a larger degree of morphological complexity. Code and pre-trained models are available at https://sigmorphon.github.io/sharedtasks/2019/task2/.

Tables2

Table 1. Table 2: The table shows the correlations between the differences in dev performance between our model with greedy decoding and Lematus and two aspects of the data: number of tokens and number of tags.

	Pearson’s $R v$	Spearman’s $ρ$
# tags vs. $Δ$	0.206	0.209
# tokens vs. $Δ$	-0.808	-0.845

Table 2. Table 5: Morphological Tagging Performance on development set.

	F1 Score
Arabic	85.62
Basque	83.68
Croatian	85.37
Dutch	90.92
Estonian	65.80
Finnish	87.94
German	79.45
Greek	87.63
Hindi	87.89
Hungarian	86.00
Italian	93.78
Latvian	80.96
Polish	80.29
Portuguese	93.65
Romanian	93.51
Russian	83.69
Slovak	64.53
Slovenian	88.81
Turkish	82.60
Urdu	72.86
AVERAGE	83.75

Equations18

p (ℓ, m ∣ w)

p (ℓ, m ∣ w)

= i = 1 \prod n Neural Transducer p (ℓ_{i} ∣ m_{i}, w_{i}) Neural Tagger p (m ∣ w)

u_{i} = [cLSTM (c_{1} \dots c_{n}); cLSTM (c_{n} \dots c_{1})]

u_{i} = [cLSTM (c_{1} \dots c_{n}); cLSTM (c_{n} \dots c_{1})]

p (ℓ ∣ m, w) = a \in A \sum

p (ℓ ∣ m, w) = a \in A \sum

= a \in A \sum j = 1 \prod ∣ ℓ ∣

\times p (a_{j} ∣ a_{j - 1}, ℓ_{< j}, m, w)

= a \in A \sum j = 1 \prod ∣ ℓ ∣

\times p (a_{j} ∣ a_{j - 1}, h^{(enc)}, h_{j}^{(dec)})

m^{⋆} = argmax_{m} lo g p (m ∣ w)

m^{⋆} = argmax_{m} lo g p (m ∣ w)

ℓ_{i}^{⋆} = argmax_{ℓ} lo g p (ℓ ∣ m_{i}^{⋆}, w_{i})

ℓ_{i}^{⋆} = argmax_{ℓ} lo g p (ℓ ∣ m_{i}^{⋆}, w_{i})

ℓ_{i}^{⋆}

ℓ_{i}^{⋆}

argmax_{ℓ} lo g m_{i} \in K (w_{i}) \sum p (ℓ ∣ m_{i}, w_{i}) p (m_{i} ∣ w)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Simple Joint Model for Improved Contextual Neural Lemmatization

Chaitanya Malaviya ${}^{\textrm{*,{\tipaencoding B}}}$

Shijie Wu ${}^{\textrm{*,\textschwa}}$

Ryan Cotterell ${}^{\textrm{\textschwa,{\tipaencoding H}}}$

${}^{\textrm{{\tipaencoding B}}}$ Allen Institute for Artificial Intelligence

${}^{\textrm{\textschwa}}$ Department of Computer Science, Johns Hopkins University

${}^{\textrm{{\tipaencoding H}}}$ Department of Computer Science and Technology, University of Cambridge

[email protected], [email protected], [email protected]

Abstract

English verbs have multiple forms. For instance, talk may also appear as talks, talked or talking, depending on the context. The NLP task of lemmatization seeks to map these diverse forms back to a canonical one, known as the lemma. We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages from the Universal Dependencies corpora. Our paper describes the model in addition to training and decoding procedures. Error analysis indicates that joint morphological tagging and lemmatization is especially helpful in low-resource lemmatization and languages that display a larger degree of morphological complexity. Code and pre-trained models are available at https://sigmorphon.github.io/sharedtasks/2019/task2/.

1 Introduction

00footnotetext: * Equal contribution. Listing order is random.

Lemmatization is a core NLP task that involves a string-to-string transduction from an inflected word form to its citation form, known as the lemma. More concretely, consider the English sentence: The bulls are running in Pamplona. A lemmatizer will seek to map each word to a form you may find in a dictionary—for instance, mapping running to run. This linguistic normalization is important in several downstream NLP applications, especially for highly inflected languages. Lemmatization has previously been shown to improve recall for information retrieval Kanis and Skorkovská (2010); Monz and De Rijke (2001), to aid machine translation Fraser et al. (2012); Chahuneau et al. (2013) and is a core part of modern parsing systems Björkelund et al. (2010); Zeman et al. (2018).

However, the task is quite nuanced as the proper choice of the lemma is context dependent. For instance, in the sentence A running of the bulls took place in Pamplona, the word running is its own lemma, since, here, running is a noun rather than an inflected verb. Several counter-examples exist to this trend, as discussed in depth in Haspelmath and Sims (2013). Thus, a good lemmatizer must make use of some representation of each word’s sentential context. The research question in this work is, then, how do we design a lemmatization model that best extracts the morpho-syntax from the sentential context?

Recent work Bergmanis and Goldwater (2018) has presented a system that directly summarizes the sentential context using a recurrent neural network to decide how to lemmatize. As Bergmanis and Goldwater (2018)’s system currently achieves state-of-the-art results, it must implicitly learn a contextual representation that encodes the necessary morpho-syntax, as such knowledge is requisite for the task. We contend, however, that rather than expecting the network to implicitly learn some notion of morpho-syntax, it is better to explicitly train a joint model to morphologically disambiguate and lemmatize. Indeed, to this end, we introduce a joint model for the introduction of morphology into a neural lemmatizer. A key feature of our model is its simplicity: Our contribution is to show how to stitch existing models together into a joint model, explaining how to train and decode the model. However, despite the model’s simplicity, it still achieves a significant improvement over the state of the art on our target task: lemmatization.

Experimentally, our contributions are threefold. First, we show that our joint model achieves state-of-the-art results, outperforming (on average) all competing approaches on a 20-language subset of the Universal Dependencies (UD) corpora Nivre et al. (2017). Second, by providing the joint model with gold morphological tags, we demonstrate that we are far from achieving the upper bound on performance—improvements on morphological tagging could lead to substantially better lemmatization. Finally, we provide a detailed error analysis indicating when and why morphological analysis helps lemmatization. We offer two tangible recommendations: one is better off using a joint model (i) for languages with fewer training data available and (ii) languages that have richer morphology.

Our system and pre-trained models on all languages in the latest version of the UD corpora111We compare to previously published numbers on non-recent versions of UD, but the models we release are trained on the current version (2.3).222Instead of UD schema for morphological attributes, we use the UniMorph schema Sylak-Glassman (2016) instead. Note the mapping from UD schema to UniMorph schema is not one-to-one mapping McCarthy et al. (2018). are released at https://sigmorphon.github.io/sharedtasks/2019/task2/.

2 Background: Lemmatization

Most languages Dryer and Haspelmath (2013) in the world exhibit a linguistic phenomenon known as inflectional morphology, which causes word forms to mutate according to the syntactic category of the word. The syntactic context in which the word form occurs determines which form is properly used. One privileged form in the set of inflections is called the lemma. We regard the lemma as a lexicographic convention, often used to better organize dictionaries. Thus, the choice of which inflected form is the lemma is motivated by tradition and convenience, e.g., the lemma is the infinitive for verbs in some Indo-European languages, rather than by linguistic or cognitive concerns. Note that the stem differs from the lemma in that the stem may not be an actual inflection.333The stem is also often ill-defined. What is, for instance, the stem of the word running, is it run or runn? In the NLP literature, the syntactic category that each inflected form encodes is called the morphological tag. The morphological tag generalizes traditional part-of-speech tags, enriching them with further linguistic knowledge such as tense, mood, and grammatical case. We call the individual key–attribute pairs morphological attributes. An example of a sentence annotated with morphological tags and lemmata in context is given in Figure 2. The task of mapping a sentence to a sequence of morphological tags is known as morphological tagging.

Notation.

Let $\mathbf{w}=w_{1},\ldots,w_{n}$ be a sequence of $n$ words. Each individual word is denoted as $w_{i}$ . Likewise, let $\mathbf{m}=m_{1},\ldots,m_{n}$ and ${\boldsymbol{\ell}}=\ell_{1},\ldots,\ell_{n}$ be sequences of morphological tags and lemmata, respectively. We will denote the set of all tags seen in a treebank as $\mathcal{Y}$ . We remark that $m_{i}$ is $w_{i}$ ’s morphological tag (e.g. $\left[\right.$ pos $=$ n, case $=$ nom, num $=$ sg $\left.\right]$ as a single label) and $\ell_{i}$ is $w_{i}$ ’s lemma. We will denote a language’s discrete alphabet of characters as $\Sigma$ . Thus, we have $w_{i},\ell_{i}\in\Sigma^{*}$ . Furthermore, we $\mathbf{c}=c_{1},\ldots,c_{n}$ be a vector of characters where $c_{i}\in\Sigma$ .

3 A Joint Neural Model

The primary contribution of this paper is a joint model of morphological tagging and lemmatization. The intuition behind the joint model is simple: high-accuracy lemmatization requires a representation of the sentential context, in which the word occurs (this behind has been evinced in § 1)—a morphological tag provides the precise summary of the context required to choose the correct lemma. Armed with this, we define our joint model of lemmatization and morphological tagging as:

[TABLE]

Figure 1 illustrates the structure of our model in the form of a graphical model. We will discuss the lemmatization factor and the morphological tagging factor following two subsections, separately. We caution the reader that the discussion of these models will be brief: Neither of these particular components is novel with respect to the literature, so the formal details of the two models is best found in the original papers. The point of our paper is to describe a simple manner to combine these existing parts into a state-of-the-art lemmatizer.

3.1 Morphological Tagger: $p(\mathbf{m}\mid\mathbf{w})$

We employ a simple LSTM-based tagger to recover the morphological tags of a sentence Heigold et al. (2017); Cotterell and Heigold (2017). We also experimented with the neural conditional random field of Malaviya et al. (2018), but Heigold et al. (2017) gave slightly better tagging scores on average and is faster to train. Given a sequence of $n$ words $\mathbf{w}=w_{1},\ldots,w_{n}$ , we would like to obtain the morphological tags $\mathbf{m}=m_{1},\ldots,m_{n}$ for each word, where $m_{i}\in\mathcal{Y}$ . The model first obtains a word representation for each token using a character-level biLSTM Graves et al. (2013) embedder, which is then input to a word-level biLSTM tagger that predicts tags for each word. Given a function cLSTM that returns the last hidden state of a character-based LSTM, first we obtain a word representation $\mathbf{u}_{i}$ for word $w_{i}$ as,

[TABLE]

where $c_{1},\ldots,c_{n}$ is the character sequence of the word. This representation $\mathbf{u}_{i}$ is then input to a word-level biLSTM tagger. The word-level biLSTM tagger predicts a tag from $\mathcal{Y}$ . A full description of the model is found in Heigold et al. (2017)author=ryan,color=violet!40,size=,fancyline,caption=,]For camera ready, add citation to my paper. I removed it for anonymity.. We use standard cross-entropy loss for training this model and decode greedily while predicting the tags during test-time. Note that greedy decoding is optimal in this tagger as there is no interdependence between the tags $m_{i}$ .

3.2 A Lemmatizer: $p(\ell_{i}\mid m_{i},w_{i})$

Neural sequence-to-sequence models Sutskever et al. (2014); Bahdanau et al. (2015) have yielded state-of-the-art performance on the task of generating morphological variants—including the lemma—as evinced in several recent shared tasks on the subject Cotterell et al. (2016, 2017, 2018). Our lemmatization factor in eq. 1 is based on such models. Specifically, we make use of a hard-attention mechanism Xu et al. (2015); Rastogi et al. (2016), rather than the original soft-attention mechanism. Our choice of hard attention is motivated by the performance of Makarov and Clematide (2018)’s system at the CoNLL-SIGMORPHON task. We use a nearly identical model, but opt for an exact dynamic-programming-based inference scheme Wu et al. (2018).444Our formulation differs from the work of Wu et al. (2018) in that we enforce monotonic hard alignments, rather than allow for non-monotonic alignments.

We briefly describe the model here. Given an inflected word $w$ and a tag $m$ , we would like to obtain the lemma $\ell\in\Sigma^{*}$ , dropping the subscript for simplicity. Moreover, for the remainder of this section the subscripts will index into the character string $\ell$ , that is $\ell=\ell_{1},\ldots,\ell_{|\ell|}$ , where each $\ell_{i}\in\Sigma$ . A character-level biLSTM encoder embeds $w$ to $\mathbf{h}^{\textit{(enc)}}$ . The decoder LSTM produces $\mathbf{h}^{\textit{(dec)}}_{j}$ , reading the concatenation of the embedding of the previous character $\ell_{j-1}\in\Sigma$ and the tag embedding $\mathbf{h}^{\textit{(tag)}}$ , which is produced by an order-invariant linear function. In contrast to soft attention, hard attention models the alignment distribution explicitly.

We denote $A=\{a_{1},\ldots,a_{|w|}\}^{|\ell|}$ as the set of all non-monotonic alignments from $w$ to $\ell$ where an alignment aligns each target character $\ell_{j}$ to exactly one source character in $w$ and for $\mathbf{a}\in A$ , $a_{j}=i$ refers to the event that the $j^{\text{th}}$ character of $\ell$ is aligned to the $i^{\text{th}}$ character of $w$ . We factor the probabilistic lemmatizer:

[TABLE]

The summation is computed with dynamic programming—specifically, using the forward algorithm for hidden Markov models Rabiner (1989). $p(\ell_{j}\mid\mathbf{h}^{\textit{(enc)}}_{a_{j}},\mathbf{h}^{\textit{(dec)}}_{j})$ is a two-layer feed-forward network followed by a softmax. The transition $p(a_{j}\mid a_{j-1},\mathbf{h}^{\textit{(enc)}},\mathbf{h}^{\textit{(dec)}}_{j})$ is the multiplicative attention function with $\mathbf{h}^{\textit{(enc)}}$ and $\mathbf{h}^{\textit{(dec)}}_{j}$ as input. To enforce monotonicity, $p(a_{j}\mid a_{j-1})=0$ if $a_{j}<a_{j-1}$ . The exact details of the lemmatizer are given by Wu and Cotterell (2019).

3.3 Decoding

We consider two manners, by which we decode our model. The first is a greedy decoding scheme. The second is a crunching May and Knight (2006) scheme. We describe each in turn.

Greedy Decoding.

In the greedy scheme, we select the best morphological tag sequence

[TABLE]

and then decode each lemmata

[TABLE]

Note that we slightly abuse notation since the argmax here is approximate: exact decoding of our neural lemmatizer is hard. This sort of scheme is also referred to as pipeline decoding.

Crunching.

In the crunching scheme, we first extract a $k$ -best list of taggings from the morphological tagger. For an input sentence $\mathbf{w}$ , call the $k$ -best tags for the $i^{\text{th}}$ word ${\cal K}(w_{i})$ . Crunching then says we should decode in the following manner

[TABLE]

Crunching is a tractable heuristic that approximates true joint decoding555True joint decoding would sum over all possible morphological tags, rather than just the $k$ -best. While this is tractable in our setting in the sense that there are, at, most, $1662$ morphological tags (in the case of Basque), it is significantly slower than using a smaller $k$ . Indeed, the probability distribution that morphological taggers learn tend to be peaked to the point that considering improbable tags is not necessary. and, as such, we expect it to outperform the more naïve greedy approach.

3.4 Training with Jackknifing

In our model, a simple application of maximum-likelihood estimation (MLE) is unlikely to work well. The reason is that our model is a discriminative directed graphical model (as seen in Figure 1) and, thus, suffers from exposure bias Ranzato et al. (2015). The intuition behind the poor performance of MLE is simple: the output of the lemmatizer depends on the output of the morphological tagger; as the lemmatizer has only ever seen correct morphological tags, it has never learned to adjust for the errors that will be made at the time of decoding. To compensate for this, we employ jackknifing Agić and Schluter (2017), which is standard practice in many NLP pipelines, such as dependency parsing.

Jackknifing for training NLP pipelines is quite similar to the oft-employed cross-validation. We divide our training data into $\kappa$ splits. Then, for each split $i\in\{1,\ldots,\kappa\}$ , we train the morphological tagger on all but the $i^{\text{th}}$ split and then decode the trained tagger, using either greedy decoding or crunching, to get silver-standard tags for the held-out $i^{\text{th}}$ split. Finally, we take our collection of silver-standard morphological taggings and use those as input to the lemmatizer in order to train it. This technique helps avoid exposure bias and improves the lemmatization performance, which we will demonstrate empirically in § 4. Indeed, the model is quite ineffective without this training regime. Note that we employ jackknifing for both the greedy decoding scheme and the crunching decoding scheme.

4 Experimental Setup

4.1 Dataset

To enable a fair comparison with Bergmanis and Goldwater (2018), we use the Universal Dependencies Treebanks Nivre et al. (2017) for all our experiments. Following previous work, we use v2.0 of the treebanks for all languages, except Dutch, for which v2.1 was used due to inconsistencies in v2.0. The standard splits are used for all treebanks.

4.2 Training Setup and Hyperparameters

For the morphological tagger, we use the baseline implementation from Malaviya et al. (2018). This implementation uses an input layer and linear layer dimension of 128 and a 2-layer LSTM with a hidden layer dimension of 256. The Adam (Kingma and Ba, 2015) optimizer is used for training and a dropout rate Srivastava et al. (2014) of 0.3 is enforced during training. The tagger was trained for 10 epochs.

For the lemmatizer, we use a 2-layer biLSTM encoder and a 1-layer LSTM decoder with 400 hidden units. The dimensions of character and tag embedding are 200 and 40, respectively. We enforce a dropout rate of 0.4 in the embedding and encoder LSTM layers. The lemmatizer is also trained with Adam and the learning rate is 0.001. We halve the learning rate whenever the development log-likelihood increases and we perform early-stopping when the learning rate reaches $1\times 10^{-5}$ . We apply gradient clipping with a maximum gradient norm of 5.

4.3 Baselines (and Related Work)

Previous work on lemmatization has investigated both neural Bergmanis and Goldwater (2019) and non-neural Chrupała (2008); Müller et al. (2015); Nicolai and Kondrak (2016); Cotterell et al. (2017) methods. We compare our approach against competing methods that report results on UD datasets.

Lematus.

The current state of the art is held by Bergmanis and Goldwater (2018), who, as discussed in § 1, provide a direct context-to-lemma approach, avoiding the use of morphological tags. We remark that Bergmanis and Goldwater (2018) assume a setting where lemmata are annotated at the token level, but morphological tags are not available; we contend, however, that such a setting is not entirely realistic as almost all corpora annotated with lemmata at the token level include morpho-syntactic annotation, including the vast majority of the UD corpora. Thus, we do not consider it a stretch to assume the annotation of morphological tags to train our joint model.666After correspondence with Toms Bergmanis, we would like to clarify this point. While Bergmanis and Goldwater (2018) explores the model in a token-annotated setting, as do we, the authors argue that such a model is better for a very low-resource scenario where the entire sentence is not annotated for lemmata. We concede this point—our current model is not applicable in such a setting. However, we note that a semi-supervised morphological tagger could be trained in such a situation as well, which may benefit lemmatization.

UDPipe.

Our next baseline is the UDPipe system of Straka and Straková (2017). Their system performs lemmatization using an averaged perceptron tagger that predicts a (lemma rule, UPOS) pair. Here, a lemma rule generates a lemma by removing parts of the word prefix/suffix and prepending and appending a new prefix/suffix. A guesser first produces correct lemma rules and the tagger is used to disambiguate from them.

Lemming.

The strongest non-neural baseline we consider is the system of Müller et al. (2015), who, like us, develop a joint model of morphological tagging lemmatization. In contrast to us, however, their model is globally normalized Lafferty et al. (2001). Due to their global normalization, they directly estimate the parameters of their model with MLE without worrying about exposure bias. However, in order to efficiently normalize the model, they heuristically limit the set of possible lemmata through the use of edit trees Chrupała (2008), which makes the computation of the partition function tractable.

Morfette.

Much like Müller et al. (2015), Morfette relies on the concept of edit trees. However, a simple perceptron is used for classification with hand-crafted features. A full description of the model is given in Chrupala et al. (2008).

5 Results and Discussion

Experimentally, we aim to show three points. i) Our joint model (eq. 1) of morphological tagging and lemmatization achieves state-of-the-art accuracy; this builds on the findings of Bergmanis and Goldwater (2018), who show that context significantly helps neural lemmatization. Moreover, the upper bound for contextual lemmatizers that make use of morphological tags is much higher, indicating room for improved lemmatization with better morphological taggers. ii) We discuss a number of error patterns that the model seems to make on the languages, where absolute accuracy is lowest: Latvian, Estonian and Arabic. We suggest possible paths forward to improve performance. iii) We offer an explanation for when our joint model does better than the context-to-lemma baseline. We show in a correlational study that our joint approach with morphological tagging helps the most with low-resource and morphologically rich languages.

5.1 Main Results

The first experiment we run focuses on pure performance of the model. Our goal is to determine whether joint morphological tagging and lemmatization improves average performance in a state-of-the-art neural model.

Evaluation Metrics.

For measuring lemmatization performance, we measure the accuracy of guessing the lemmata correctly over an entire corpus. To demonstrate the effectiveness of our model in utilizing context and generalizing to unseen word forms, we follow Bergmanis and Goldwater (2018) and also report accuracies on tokens that are i) ambiguous, i.e., more than one lemmata exist for the same inflected form, ii) unseen, i.e., where the inflected form has not been seen in the training set, and iii) seen unambiguous, i.e., where the inflected form has only one lemma and is seen in the training set.

Results.

The results showing comparisons with all other methods are summarized in Figure 3. Additional results are presented in App. A. Each bar represents the average accuracy across 20 languages. Our method achieves an average accuracy of $95.42$ and the strongest baseline, Bergmanis and Goldwater (2018), achieves an average accuracy of $95.05$ . The difference in performance ( $0.37$ ) is statistically significant with $p<0.01$ under a paired permutation test. We outperform the strongest baseline in 11 out of 20 languages and underperform in only 3 languages with $p<0.05$ . The difference between our method and all other baselines is statistical significant with $p<0.001$ in all cases. We highlight two additional features of the data. First, decoding using gold morphological tags gives an accuracy of $98.04$ for a difference in performance of $+2.62$ . We take the large difference between the upper bound and the current performance of our model to indicate that improved morphological tagging is likely to significantly help lemmatization. Second, it is noteworthy that training with gold tags, but decoding with predicted tags, yields performance that is significantly worse than every baseline except for UDPipe. This speaks for the importance of jackknifing in the training of joint morphological tagger-lemmatizers that are directed and, therefore, suffer from exposure bias.

In Figure 4, we observed crunching further improves performance of the greedy decoding scheme. In 8 out of 20 languages, the improvement is statistical significant with $p<0.05$ . We select the best $k$ for each language based on the development set.

In Figure 5, we provide a language-wise breakdown of the performance of our model and the model of Bergmanis and Goldwater (2018). Our strongest improvements are seen in Latvian, Greek and Hungarian. When measuring performance solely over unseen inflected forms, we achieve even stronger gains over the baseline method in most languages. This demonstrates the generalization power of our model beyond word forms seen in the training set. In addition, our accuracies on ambiguous tokens are also seen to be higher than the baseline on average, with strong improvements on highly inflected languages such as Latvian and Russian. Finally, on seen unambiguous tokens, we note improvements that are similar across all languages.

5.2 Error Patterns

We attempt to identify systematic error patterns of our model in an effort to motivate future work. For this analysis, we compare predictions of our model and the gold lemmata on three languages with the weakest absolute performance: Estonian, Latvian and Arabic. First, we note the differences in the average lengths of gold lemmata in the tokens we guess incorrectly and all the tokens in the corpus. The lemmata we guess incorrectly are on average 1.04 characters longer than the average length of all the lemmata in the corpus. We found that the length of the incorrect lemmata does not correlate strongly with their frequency. Next, we identify the most common set of edit operations in each language that would transform the incorrect hypothesis to the gold lemma. This set of edit operations was found to follow a power-law distribution.

For the case of Latvian, we find that the operation {replace: s $\rightarrow$ a} is the most common error made by our model. This operation corresponds to a possible issue in the Latvian treebank, where adjectives were marked with gendered lemmas. This issue has now been resolved in the latest version of the treebank. For Estonian, the operation {insert: m, insert: a} is the most common error. The suffix -ma in Estonian is used to indicate the infinitive form of verbs. Gold lemmata for verbs in Estonian are marked in their infinitive forms whereas our system predicts the stems of these verbs instead. These inflected forms are usually ambiguous and we believe that the model doesn’t generalize well to different form-lemma pairs, partly due to fewer training data available for Estonian. This is an example of an error pattern that could be corrected using improved morphological information about the tokens. Finally, in Arabic, we find that the most common error pattern corresponds to a single ambiguous word form, ’an , which can be lemmatized as ’anna (like “that” in English) or ’an (like “to” in English) depending on the usage of the word in context. The word ’anna must be followed by a nominal sentence while ’an is followed by a verb. Hence, models that can incorporate rich contextual information would be able to avoid such errors.

5.3 Why our model performs better?

Simply presenting improved results does not entirely satiate our curiosity: we would also like to understand why our model performs better. Specifically, we have assumed an additional level of supervision—namely, the annotation of morphological tags. We provide the differences between our method and our retraining of the Lematus system presented in Table 1. In addition to the performance of the systems, we also list the number of tokens in each treebank and the number of distinct morphological tags per language. We perform a correlational study, which is shown in Table 2.

Morphological Complexity and Performance.

We see that there is a moderate positive correlation ( $\rho=0.209$ ) between the number of morphological tags in a language and the improvement our model obtains. As we take the number of tags as a proxy for the morphological complexity in the language, we view this as an indication that attempting to directly extract the relevant morpho-syntactic information from the corpus is not as effective when there is more to learn. In such languages, we recommend exploiting the additional annotation to achieve better results.

Amount of Data and Performance.

The second correlation we find is a stronger negative correlation ( $\rho=-0.845$ ) between the number of tokens available for training in the treebank and the gains in performance of our model over the baseline. This is further demonstrated by the learning curve plot in Figure 6, where we plot the validation accuracy on the Polish treebank for different sizes of the training set. The gap between the performance of our model and Lematus-ch20 is larger when fewer training data are available, especially for ambiguous tokens. This indicates that the incorporation of morphological tags into a model helps more in the low-resource setting. Indeed, this conclusion makes sense—neural networks are good at extracting features from text when there is a sufficiently large amount of data. However, in the low-resource case, we would expect direct supervision on the sort of features we desire to extract to work better. Thus, our second recommendation is to model tags jointly with lemmata when fewer training tokens are available. As we noted earlier, it is almost always the case that token-level annotation of lemmata comes with token-level annotation of morphological tags. In low-resource scenarios, a data augmentation approach such as the one proposed by Bergmanis and Goldwater (2019) can be helpful and serve complementary to our approach.

6 Conclusion

We have presented a simple joint model for morphological tagging and lemmatization and discussed techniques for training and decoding. Empirically, we have shown that our model achieves state-of-the-art results, hinting that explicitly modeling morphological tags is a more effective manner for modeling context. In addition to strong numbers, we tried to explain when and why our model does better. Specifically, we show a significant correlation between our scores and the number of tokens and tags present in a treebank. We take this to indicate that our method improves performance more for low-resource languages as well as morphologically rich languages.

Acknowledgments

We thank Toms Bergmanis for his detailed feedback on the accepted version of the manuscript. Additionally, we would like to thank the three anonymous reviewers for their valuable suggestions. The last author would like to acknowledge support from a Facebook Fellowship.

Appendix A Additional Results

We present the exact numbers on all languages to allow future papers to compare to our results in Table 3 and Table 4. We also present morphological tagging results in Table 5.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agić and Schluter (2017) Željko Agić and Natalie Schluter. 2017. How (not) to train a dependency parser: The curious case of jackknifing part-of-speech taggers . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 679–684. Association for Computational Linguistics. · doi ↗
2Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations .
3Bergmanis and Goldwater (2018) Toms Bergmanis and Sharon Goldwater. 2018. Context sensitive neural lemmatization with Lematus . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1391–1400. Association for Computational Linguistics. · doi ↗
4Bergmanis and Goldwater (2019) Toms Bergmanis and Sharon Goldwater. 2019. Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw Text . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) . Association for Computational Linguistics.
5Björkelund et al. (2010) Anders Björkelund, Bernd Bohnet, Love Hafdell, and Pierre Nugues. 2010. A high-performance syntactic and semantic dependency parser . In COLING 2010: Demonstrations , pages 33–36, Beijing, China. Coling 2010 Organizing Committee.
6Chahuneau et al. (2013) Victor Chahuneau, Eva Schlinger, Noah A. Smith, and Chris Dyer. 2013. Translating into morphologically rich languages with synthetic phrases . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , pages 1677–1687, Seattle, Washington, USA. Association for Computational Linguistics.
7Chrupała (2008) Grzegorz Chrupała. 2008. Towards a machine-learning architecture for lexical functional grammar parsing . Ph.D. thesis, Dublin City University.
8Chrupala et al. (2008) Grzegorz Chrupala, Georgiana Dinu, and Josef van Genabith. 2008. Learning morphology with Morfette . In Proceedings of the Sixth International Conference on Language Resources and Evaluation , Marrakech, Morocco. European Language Resources Association (ELRA).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Simple Joint Model for Improved Contextual Neural Lemmatization

Abstract

1 Introduction

2 Background: Lemmatization

Notation.

3 A Joint Neural Model

3.1 Morphological Tagger: p(m∣w)p(\mathbf{m}\mid\mathbf{w})p(m∣w)

3.2 A Lemmatizer: p(ℓi∣mi,wi)p(\ell_{i}\mid m_{i},w_{i})p(ℓi​∣mi​,wi​)

3.3 Decoding

Greedy Decoding.

Crunching.

3.4 Training with Jackknifing

4 Experimental Setup

4.1 Dataset

4.2 Training Setup and Hyperparameters

4.3 Baselines (and Related Work)

Lematus.

UDPipe.

Lemming.

Morfette.

5 Results and Discussion

5.1 Main Results

Evaluation Metrics.

Results.

5.2 Error Patterns

5.3 Why our model performs better?

Morphological Complexity and Performance.

Amount of Data and Performance.

6 Conclusion

Acknowledgments

Appendix A Additional Results

3.1 Morphological Tagger: $p(\mathbf{m}\mid\mathbf{w})$

3.2 A Lemmatizer: $p(\ell_{i}\mid m_{i},w_{i})$