Training Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw Text
Toms Bergmanis, Sharon Goldwater

TL;DR
This paper introduces a method for training context-sensitive neural lemmatizers in low-resource settings by leveraging inflection tables and raw text, improving generalization especially on unseen words.
Contribution
It presents a novel approach that combines inflection tables with raw text to train lemmatizers without requiring fully annotated sentences.
Findings
Improved lemmatization accuracy on unseen words.
Effective training with minimal labeled data.
Generalization from unambiguous examples.
Abstract
Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. Using context can help, both for unseen and ambiguous words. Yet most context-sensitive approaches require full lemma-annotated sentences for training, which may be scarce or unavailable in low-resource languages. In addition (as shown here), in a low-resource setting, a lemmatizer can learn more from labeled examples of distinct words (types) than from (contiguous) labeled tokens, since the latter contain far fewer distinct types. To combine the efficiency of type-based learning with the benefits of context, we propose a way to train a context-sensitive lemmatizer with little or no labeled corpus data, using inflection tables from the UniMorph project and raw text examples from Wikipedia that provide sentence contexts for the unambiguous UniMorph examples.ā¦
| noun: ceļŔ | noun: celis | |||
|---|---|---|---|---|
| SG | PL | SG | PL | |
| NOM | ceļŔ | ceļi | celis | ceļi |
| GEN | ceļa | ceļu | ceļa | ceļu |
| DAT | ceļam | ceļiem | celim | ceļiem |
| ACC | ceļu | ceļus | celi | ceļus |
| INS | ceļu | ceļiem | celi | ceļiem |
| LOC | ceÄ¼Ä | ceļos | celÄ« | ceļos |
| VOC | ceļ | ceļi | celi | ceļi |
| Ambig. | Unseen | All | ||
|---|---|---|---|---|
| Tokens | Baseline | 41.0 | 26.6 | 31.0 |
| Lemming | 38.2 | 48.3 | 50.6 | |
| HMAM | 41.4 | 50.2 | 52.1 | |
| Lematus 0-ch | 39.9 | 43.7 | 46.8 | |
| Lematus 20-ch | 38.4 | 42.8 | 45.8 | |
| Types | Baseline | 45.0 | 26.6 | 32.4 |
| Lemming | N/A | N/A | N/A | |
| HMAM | 41.8 | 53.7 | 56.3 | |
| Lematus 0-ch | 42.5 | 53.7 | 55.1 | |
| Lematus 20-ch | 43.1 | 51.7 | 54.9 |
| DEVELOPMENT | TEST | |||
|
Ambig. |
Unseen |
All |
All |
|
| 1k UDT types (No augmentation) | ||||
| Baseline | 49.1 | 30.8 | 36.7 | - |
| HMAM | 46.3 | 58.9ā ā” | 61.5ā ā” | 62.6ā ā” |
| Lematus 0-ch | 46.5 | 55.0 | 58.5 | 59.1ā” |
| Lematus 20-ch | 45.0 | 54.3 | 57.7 | 57.7 |
| 1k UDT types + 1k UM types | ||||
| Baseline | 45.9 | 30.8 | 38.4 | - |
| AE Aug Baseline | 45.6 | 57.5 | 60.4 | 60.8 |
| HMAM | 45.9 | 60.2 | 64.2 | 64.3 |
| Lematus 0-ch | 46.6 | 59.0 | 63.4 | 63.6 |
| Lematus 20-ch | 49.8ā | 61.7āā | 65.5āā | 65.3ā |
| 1k UDT types + 5k UM types | ||||
| Baseline | 55.4āā ā” | 30.7 | 41.7 | - |
| AE Aug Baseline | 46.0 | 58.8 | 61.3 | 61.6 |
| HMAM | 46.7 | 60.8 | 65.7 | 65.7 |
| Lematus 0-ch | 46.2 | 61.5 | 66.1 | 66.4 |
| Lematus 20-ch | 48.6 | 65.4āā | 69.2āā | 69.5 āā |
| 1k UDT types + 10k UM types | ||||
| Baseline | 54.9āā | 31.2 | 43.5 | - |
| AE Aug Baseline | 46.3 | 58.6 | 61.2 | 61.7 |
| HMAM | 45.4 | 60.8 | 65.5 | 65.3 |
| Lematus 0-ch | 45.5 | 62.1 | 66.4 | 66.4 |
| Lematus 20-ch | 49.5ā | 66.7āā | 70.6āā | 70.9āā |
| Type accuracy: | Ambig. | Unseen | All |
|---|---|---|---|
| 1k UDT+10k UM | 49.5 | 66.7 | 70.6 |
| 10k UDT tok. | 59.6 | 71.4 | 76.6 |
| 10k UDT tok.+10k UM | 60.8 | 75.1 | 80.1 |
| Token accuracy: | Ambig. | Uns. | All |
| 1k UDT+10k UM | 55.5 | 66.5 | 77.0 |
| 10k UDT tok. | 72.4 | 72.5 | 85.3 |
| 10k UDT tok.+10k UM | 72.3 | 75.3 | 87.3 |
| Type accuracy: | Ambig. | Unseen | All | |
|---|---|---|---|---|
| Bulgarian | Baseline | 63.5 | 39.3 | 45.0 |
| AE Aug Baseline | - | - | - | |
| HMAM | 50.7 | 61.0 | 63.5 | |
| Lematus 0-ch | 45.9 | 51.3 | 55.7 | |
| Lematus 20-ch | 41.6 | 47.2 | 52.1 | |
| Czech | Baseline | 38.1 | 31.2 | 33.0 |
| AE Aug Baseline | - | - | - | |
| HMAM | 45.2 | 66.8 | 66.7 | |
| Lematus 0-ch | 40.7 | 59.9 | 60.1 | |
| Lematus 20-ch | 40.1 | 58.3 | 58.6 | |
| Estonian | Baseline | 51.0 | 24.1 | 32.0 |
| AE Aug Baseline | - | - | - | |
| HMAM | 39.9 | 41.2 | 46.2 | |
| Lematus 0-ch | 38.0 | 42.8 | 47.6 | |
| Lematus 20-ch | 47.8 | 39.9 | 45.9 | |
| Finnish | Baseline | 46.4 | 21.3 | 26.1 |
| AE Aug Baseline | - | - | - | |
| HMAM | 44.7 | 48.0 | 50.4 | |
| Lematus 0-ch | 44.4 | 41.5 | 44.9 | |
| Lematus 20-ch | 44.6 | 43.0 | 46.0 | |
| Latvian | Baseline | 42.4 | 25.6 | 31.6 |
| AE Aug Baseline | - | - | - | |
| HMAM | 44.0 | 52.6 | 55.6 | |
| Lematus 0-ch | 47.1 | 51.8 | 55.2 | |
| Lematus 20-ch | 43.1 | 52.1 | 55.2 | |
| Polish | Baseline | 42.9 | 26.6 | 33.3 |
| AE Aug Baseline | - | - | - | |
| HMAM | 41.2 | 60.5 | 62.4 | |
| Lematus 0-ch | 40.9 | 60.4 | 62.6 | |
| Lematus 20-ch | 35.5 | 59.7 | 62.2 | |
| Romanian | Baseline | 27.6 | 34.9 | 40.0 |
| AE Aug Baseline | - | - | - | |
| HMAM | 38.8 | 55.1 | 57.9 | |
| Lematus 0-ch | 44.6 | 50.2 | 54.5 | |
| Lematus 20-ch | 40.7 | 50.9 | 54.9 | |
| Russian | Baseline | 43.0 | 34.9 | 39.0 |
| AE Aug Baseline | - | - | - | |
| HMAM | 39.3 | 66.4 | 67.0 | |
| Lematus 0-ch | 42.3 | 63.4 | 65.4 | |
| Lematus 20-ch | 44.6 | 63.7 | 65.5 | |
| Swedish | Baseline | 77.8 | 42.8 | 52.7 |
| AE Aug Baseline | - | - | - | |
| HMAM | 58.5 | 67.7 | 72.6 | |
| Lematus 0-ch | 59.5 | 64.1 | 70.1 | |
| Lematus 20-ch | 54.0 | 62.6 | 68.1 | |
| Turkish | Baseline | 58.8 | 26.6 | 33.6 |
| AE Aug Baseline | - | - | - | |
| HMAM | 60.2 | 69.6 | 72.3 | |
| Lematus 0-ch | 61.8 | 64.7 | 68.4 | |
| Lematus 20-ch | 58.2 | 65.4 | 68.6 | |
| Type accuracy: | Ambig. | Unseen | All | |
|---|---|---|---|---|
| Bulgarian | Baseline | 67.2% | 39.3% | 50.0% |
| AE Aug Baseline | 47.9% | 62.6% | 65.0% | |
| HMAM | 44.3% | 68.2% | 72.1% | |
| Lematus 0-ch | 43.1% | 67.0% | 71.1% | |
| Lematus 20-ch | 50.4% | 65.9% | 70.0% | |
| Czech | Baseline | 43.0% | 31.2% | 36.8% |
| AE Aug Baseline | 43.2% | 66.9% | 66.6% | |
| HMAM | 41.0% | 61.9% | 64.7% | |
| Lematus 0-ch | 39.5% | 61.6% | 64.4% | |
| Lematus 20-ch | 42.6% | 68.4% | 69.7% | |
| Estonian | Baseline | 62.9% | 24.1% | 37.1% |
| AE Aug Baseline | 43.1% | 40.3% | 45.3% | |
| HMAM | 48.0% | 44.9% | 53.3% | |
| Lematus 0-ch | 51.3% | 45.2% | 53.5% | |
| Lematus 20-ch | 48.6% | 49.7% | 56.3% | |
| Finnish | Baseline | 49.4% | 21.3% | 30.3% |
| AE Aug Baseline | 42.5% | 44.9% | 47.6% | |
| HMAM | 44.0% | 58.4% | 62.5% | |
| Lematus 0-ch | 45.9% | 60.8% | 64.7% | |
| Lematus 20-ch | 52.5% | 61.9% | 65.5% | |
| Latvian | Baseline | 45.6% | 25.6% | 35.9% |
| AE Aug Baseline | 39.6% | 53.4% | 55.3% | |
| HMAM | 45.2% | 52.3% | 57.6% | |
| Lematus 0-ch | 43.8% | 54.5% | 59.1% | |
| Lematus 20-ch | 44.7% | 57.6% | 61.1% | |
| Polish | Baseline | 50.4% | 26.6% | 39.2% |
| AE Aug Baseline | 38.8% | 64.1% | 66.2% | |
| HMAM | 41.6% | 62.3% | 68.4% | |
| Lematus 0-ch | 43.3% | 65.2% | 70.7% | |
| Lematus 20-ch | 40.3% | 69.7% | 73.4% | |
| Romanian | Baseline | 44.3% | 34.9% | 47.9% |
| AE Aug Baseline | 41.3% | 54.9% | 58.4% | |
| HMAM | 50.2% | 58.4% | 65.6% | |
| Lematus 0-ch | 51.4% | 60.8% | 67.2% | |
| Lematus 20-ch | 47.9% | 62.6% | 67.7% | |
| Russian | Baseline | 48.5% | 34.7% | 44.4% |
| AE Aug Baseline | 42.1% | 65.5% | 66.5% | |
| HMAM | 46.4% | 65.5% | 69.7% | |
| Lematus 0-ch | 40.5% | 64.4% | 68.5% | |
| Lematus 20-ch | 42.7% | 71.1% | 73.8% | |
| Swedish | Baseline | 80.6% | 42.8% | 58.0% |
| AE Aug Baseline | 58.7% | 67.3% | 71.4% | |
| HMAM | 51.9% | 72.6% | 77.7% | |
| Lematus 0-ch | 49.1% | 71.4% | 76.2% | |
| Lematus 20-ch | 49.5% | 72.3% | 77.4% | |
| Turkish | Baseline | 61.8% | 26.6% | 37.9% |
| AE Aug Baseline | 62.8% | 68.5% | 71.2% | |
| HMAM | 54.2% | 63.6% | 65.7% | |
| Lematus 0-ch | 53.6% | 63.9% | 65.5% | |
| Lematus 20-ch | 67.1% | 74.6% | 77.2% | |
| Type level accuracy: | Token level accuracy: | ||||||
|---|---|---|---|---|---|---|---|
| Training data | Ambig. | Unseen | All | Ambig. | Unseen | All | |
| Bulgarian | 10k UDT tok. | 62.3 | 75.7 | 80.1 | 72.3 | 75.7 | 89.5 |
| 10k UDT tok. + 10k UM types | 62.2 | 78.7 | 83.6 | 73.3 | 78.1 | 91.0 | |
| Czech | 10k UDT tok. | 49.7 | 76.4 | 77.8 | 80.7 | 77.7 | 88.3 |
| 10k UDT tok. + 10k UM types | 52.4 | 78.3 | 80.4 | 80.0 | 80.0 | 89.6 | |
| Estonian | 10k UDT tok. | 65.3 | 54.0 | 64.5 | 80.1 | 54.3 | 76.8 |
| 10k UDT tok. + 10k UM types | 65.9 | 63.4 | 72.6 | 81.5 | 64.2 | 82.4 | |
| Finnish | 10k UDT tok. | 60.7 | 60.1 | 66.5 | 73.8 | 62.4 | 78.2 |
| 10k UDT tok. + 10k UM types | 57.8 | 63.7 | 69.4 | 70.3 | 66.0 | 79.8 | |
| Latvian | 10k UDT tok. | 57.5 | 70.9 | 75.6 | 69.2 | 70.5 | 82.6 |
| 10k UDT tok. + 10k UM types | 58.9 | 73.6 | 77.8 | 70.2 | 73.8 | 84.4 | |
| Polish | 10k UDT tok. | 59.8 | 78.7 | 83.6 | 76.5 | 78.8 | 89.5 |
| 10k UDT tok. + 10k UM types | 57.4 | 81.2 | 86.1 | 71.3 | 81.4 | 90.9 | |
| Romanian | 10k UDT tok. | 51.7 | 61.1 | 66.6 | 54.1 | 60.6 | 79.1 |
| 10k UDT tok. + 10k UM types | 57.1 | 68.2 | 74.2 | 60.7 | 68.2 | 83.9 | |
| Russian | 10k UDT tok. | 64.4 | 80.5 | 83.5 | 65.9 | 80.8 | 88.5 |
| 10k UDT tok. + 10k UM types | 61.1 | 82.6 | 85.9 | 59.9 | 82.7 | 89.8 | |
| Swedish | 10k UDT tok. | 63.2 | 74.9 | 80.9 | 78.5 | 73.6 | 89.6 |
| 10k UDT tok. + 10k UM types | 65.1 | 78.4 | 83.7 | 79.0 | 75.9 | 90.4 | |
| Turkish | 10k UDT tok. | 64.2 | 82.1 | 87.1 | 73.1 | 81.8 | 91.2 |
| 10k UDT tok. + 10k UM types | 69.9 | 82.9 | 87.3 | 76.9 | 82.7 | 91.5 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques Ā· Topic Modeling Ā· Handwritten Text Recognition Techniques
Training Data Augmentation for Context-Sensitive Neural Lemmatization
Using Inflection Tables and Raw Text
Toms Bergmanis
School of Informatics
University of Edinburgh
\AndSharon Goldwater
School of Informatics
University of Edinburgh
Abstract
Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. Using context can help, both for unseen and ambiguous words. Yet most context-sensitive approaches require full lemma-annotated sentences for training, which may be scarce or unavailable in low-resource languages. In addition (as shown here), in a low-resource setting, a lemmatizer can learn more from labeled examples of distinct words (types) than from (contiguous) labeled tokens, since the latter contain far fewer distinct types. To combine the efficiency of type-based learning with the benefits of context, we propose a way to train a context-sensitive lemmatizer with little or no labeled corpus data, using inflection tables from the UniMorph project and raw text examples from Wikipedia that provide sentence contexts for the unambiguous UniMorph examples. Despite these being unambiguous examples, the model successfully generalizes from them, leading to improved results (both overall, and especially on unseen words) in comparison to a baseline that does not use context.
1 Introduction
Many lemmatizers work on isolated wordforms (Wicentowski, 2002; Dreyer etĀ al., 2008; Rastogi etĀ al., 2016; Makarov and Clematide, 2018b, a). Lemmatizing in context can improve accuracy on ambiguous and unseen words Bergmanis and Goldwater (2018), but most systems for context-sensitive lemmatization must train on complete sentences labeled with POS and/or morphological tags as well as lemmas, and have only been tested with 20k-300k training tokens (ChrupaÅa etĀ al., 2008; Müller etĀ al., 2015; Chakrabarty etĀ al., 2017).111The smallest of these corpora contains 20k tokens of Bengali annotated only with lemmas, which Chakrabarty etĀ al. (2017) reported took around two person months to create.
Intuitively, though, sentence-annotated data is inefficient for training a lemmatizer, especially in low-resource settings. Training on (say) 1000 word types will provide far more information about a languageās morphology than training on 1000 contiguous tokens, where fewer types are represented. As noted above, sentence data can help with ambiguous and unseen words, but we show here that when data is scarce, this effect is small relative to the benefit of seeing more word types.222Garrette etĀ al. (2013) found the same for POS tagging. Motivated by this result, we propose a training data augmentation method that combines the efficiency of type-based learning and the expressive power of a context-sensitive model.333Code and data: https://bitbucket.org/tomsbergmanis/data_augumentation_um_wiki We use Lematus Bergmanis and Goldwater (2018), a state-of-the-art lemmatizer that learns from lemma-annotated words in their -character contexts. No predictions about surrounding words are used, so fully annotated training sentences are not needed. We exploit this fact by combining two sources of training data: 1k lemma-annotated types (with contexts) from the Universal Dependency Treebank (UDT) v2.2444http://hdl.handle.net/11234/1-2837 NivreĀ et al. (2017), plus examples obtained by finding unambiguous word-lemma pairs in inflection tables from the Universal Morphology (UM) project555http://unimorph.org and collecting sentence contexts for them from Wikipedia. Although these examples are noisy and biased, we show that they improve lemmatization accuracy in experiments on 10 languages, and that the use of context helps, both overall and especially on unseen words.
2 Method
Lematus Bergmanis and Goldwater (2018) is a neural sequence-to-sequence model with attention inspired by the re-inflection model of Kann and Schütze (2016), which won the 2016 SIGMORPHON shared task Cotterell et al. (2016). It is built using the Nematus machine translation toolkit,666Code for Nematus: https://github.com/EdinburghNLP/nematus, Code for Lematus: https://bitbucket.org/tomsbergmanis/lematus.git which uses the architecture of Sennrich et al. (2017): a 2-layer bidirectional GRU encoder and a 2-layer decoder with a conditional GRU Sennrich et al. (2017) in the first layer and a GRU in the second layer.
Lematus takes as input a character sequence representing the wordform in its -character context, and outputs the characters of the lemma. Special input symbols are used to represent the left and right boundary of the target wordform (<lc>, <rc>) and other word boundaries (<s>). For example, if , the system trained on Latvian would be expected to produce the characters of the lemma ceļŔ (meaning road) given input such as:
s a k a <s> p a Å” v a l d Ä« b u
<lc> c e ļ u <rc>
u n <s> i e l u <s> r e ǵ i s t r
When (Lematus 0-ch), no context is used, making Lematus 0-ch comparable to other systems that do not model context Dreyer etĀ al. (2008); Rastogi etĀ al. (2016); Makarov and Clematide (2018b, a). In our experiments we use both Lematus 0-ch and Lematus 20-ch (20 characters of context), which was the best-performing system reported by Bergmanis and Goldwater (2018).
2.1 Data Augmentation
Our data augmentation method uses UM inflection tables and creates additional training examples by finding Wikipedia sentences that use the inflected wordforms in context, pairing them with their lemma as shown in the inflection table. However, we cannot use all the words in the tables because some of them are ambiguous: for example, FigureĀ 1 shows that the form ceļi could be lemmatized either as ceļŔ or celis. Since we donāt know which would be correct for any particular Wikipedia example, we only collect examples for forms which are unambiguous according to the UM tables. However, this method is only as good as the coverage of the UM tables. For example, if UM doesnāt include a table for the Latvian verb celt, then the underlined forms in TableĀ 1 would be incorrectly labeled as unambiguous.
There are several other issues with this method that could potentially limit its usefulness. First, the UM tables only include verbs, nouns and adjectives, whereas we test the system on UDT data, which includes all parts of speech. Second, by excluding ambiguous forms, we may be restricting the added examples to a non-representative subset of the potential inflections, or the system may simply ignore the context because it isnāt needed for these examples. Finally, there are some annotation differences between UM and UDT.777Recent efforts to unify the two resources have mostly focused on validating dataset schema (McCarthy etĀ al., 2018), leaving conflicts in word lemmas unresolved. We estimated (by counting types that are unambiguous in each dataset but have different lemmas across them) that annotation inconsistencies affect up to 1% of types in the languages we used. Despite all of these issues, however, we show below that the added examples and their contexts do actually help.
3 Experimental Setup
Baselines and Training Parameters
We use four baselines: (1) Lemming888http://cistern.cis.lmu.de/lemming (Müller et al., 2015) is a context-sensitive system that uses log-linear models to jointly tag and lemmatize the data, and is trained on sentences annotated with both lemmas and POS tags. (2) The hard monotonic attention model (HMAM)999https://github.com/ZurichNLP/coling2018-neural-transition-based-morphology Makarov and Clematide (2018b) is a neural sequence-to-sequence model with a hard attention mechanism that advances through the sequence monotonically. It is trained on word-lemma pairs (without context) with character-level alignments learned in a preprocessing step using an alignment model, and it has proved to be competitive in low resource scenarios. (3) Our naive Baseline outputs the most frequent lemma (or one lemma at random from the options that are equally frequent) for words observed in training. For unseen words it outputs the wordform itself. (4) We also try a baseline data augmentation approach (AE Aug Baseline) inspired by Bergmanis et al. (2017) and Kann and Schütze (2017), who showed that adding training examples where the network simply learns to auto-encode corpus words can improve morphological inflection results in low-resource settings. The AE Aug Baseline is a variant of Lematus 0-ch which augments the UDT lemmatization examples by auto-encoding the inflected forms of the UM examples (i.e., it just treats them as corpus words). Comparing AE Aug Baseline to Lematus 0-ch augmented with UM lemma-inflection examples tells us whether using the UM lemma information helps more than simply auto-encoding more inflected examples.
To train the models we use the default settings for Lemming and the suggested lemmatization parameters for HMAM. We mainly follow the hyperparameters used by Bergmanis and Goldwater (2018) for Lematus; details are in Appendix A.
Languages and Training Data
We conduct preliminary experiments on five development languages: Estonian, Finnish, Latvian, Polish, and Russian. In our final experiments we also add Bulgarian, Czech, Romanian, Swedish and Turkish. We vary the amount and type of training data (types vs. tokens, UDT only, UM only, or UDT plus up to 10k UM examples), as described in SectionĀ 4.
To obtain UM-based training examples, we select the first unambiguous UM types (with their sentence contexts) from shuffled Wikipedia sentences. For experiments with examples per type, we first find all UM types with at least sentence contexts in Wikipedia and then choose the distinct types and their contexts uniformly at random.
Evaluation
To evaluate modelsā ability to lemmatize wordforms in their sentence context we follow Bergmanis and Goldwater (2018) and use the full UDT development and test sets. Unlike Bergmanis and Goldwater (2018) who reported token level lemmatization exact match accuracy, we report type-level macro averaged lemmatization exact match accuracy. This measure better reflects improvements on unseen words, which tend to be rare but are more important (since a most-frequent-lemma baseline does very well on seen words, as shown by Bergmanis and Goldwater (2018)).
We separately report performance on unseen and ambiguous tokens. For a fair comparison across scenarios with different training sets, we count as unseen only words that are not ambiguous and are absent from all training sets/scenarios introduced in Section 4. Due to the small training sets, between 70-90% of dev set types are classed as unseen in each language. We define a type as ambiguous if the empirical entropy over its lemmas is greater than 0.1 in the full original UDT training splits.101010This measure, adjusted ambiguity, was defined by Kirefu (2018), who noticed that many frequent wordforms appear to have multiple lemmas due to annotation errors. The adjusted ambiguity filters out these cases. According to this measure, only 1.2-5.3% of dev set types are classed as ambiguous in each language.
Significance Testing
All systems are trained and tested on ten languages. To test for statistically significant differences between the results of two systems we use a Monte Carlo method: for each set of results (i.e. a set of 10 numerical values) we generate 10000 random samples, where each sample swaps the results of the two systems for each language with a probability of . We then obtain a p-value as the proportion of samples for which the difference on average was at least as large as the difference observed in our experiments.
4 Experiments, Results, and Discussion
Types vs. Tokens and Context in Very Low Resource Settings
We compare training on the first 1k tokens vs. first 1k distinct types of the UDT training sets. TableĀ 2 shows that if only 1k examples are available, using types is clearly better for all systems. Although Lematus does relatively poorly on the token data, it benefits the most from switching to types, putting it on par with HMAM and suggesting is it likely to benefit more from additional type data. Lemming requires token-based data, but does worse than HMAM (a context-free method) in the token-based setting, and we also see no benefit from context in comparing Lematus 20-ch vs Lematus 0-ch. So overall, in this very low-resource scenario with no data augmentation, context does not appear to help.
Using UM + Wikipedia Only
We now try training only on UM + Wikipedia examples, rather than examples from UDT. We use 1k, 2k or 5k unambiguous types from UM with a single example context from Wikipedia for each. With 5k types we also try adding more example contexts (2, 3, or 5 examples for each type).
Figure 1 presents the results (for unseen words only). As with the UDT experiments, there is little difference between Lematus 20-ch and Lematus 0-ch in the smallest data setting. However, when the number of training types increases to 5k, the benefits of context begin to show, with Lematus 20-ch yielding a 1.6% statistically significant () improvement over Lematus 0-ch. The results for increasing the number of examples per type are numerically higher than the one-example case, but the differences are not statistically significant.
It is worth noting that the accuracy even with 5k UM types is considerably lower than the accuracy of the model trained on only 1k UDT types (see TableĀ 2). We believe this discrepancy is due to the issues of biased/incomplete data noted above. For example, we analyzed the Latvian data and found that the available tables for nouns, verbs, and adjectives give rise to 78 paradigm slots. The 17 POS tags in UDT give rise to about 10 times as many paradigm slots, although only 448 are present in the unseen words of the dev set. Of these, 197 are represented amongst the 1k UDT training types, whereas only 25 are included in the 1k UM training types. As a result, about of the unseen types of dev set have no representative of their paradigm slot in 1k types of UM, whereas this figure is only for the 1k types of UDT.
Data Augmentation
Although UM + Wikipedia examples alone are not sufficient to train a good lemmatizer, they might improve a low-resource baseline trained on UDT data. To see, we augmented the 1k UDT types with 1k, 5k or 10k UM types with contexts from Wikipedia.
TableĀ 3 summarizes the results, showing that despite the lower quality of the UM + Wikipedia examples, using them improves results of all systems, and more so with more examples. Improvements are especially strong for unseen types, which constitute more than 70% of types in the dev set. Furthermore, the benefit of the additional UM examples is above and beyond the effect of auto-encoding (AE Aug Baseline) for all systems in all data scenarios.
Considering the two context-free models, HMAM does better on the un-augmented 1k UDT data, but (as predicted by our results above) it benefits less from data augmentation than does Lematus 0-ch, so with added data they are statistically equivalent ( on the test set with 10k UM).
More importantly, Lematus 20-ch begins to outperform the context-free models with as few as 1k UM + Wikipedia examples, and the difference increases with more examples, eventually reaching over 4% better on the test set than the next best model (Lematus 0-ch) when 10k UM + Wikipedia examples are used () This indicates that the system can learn useful contextual cues even from unambiguous training examples.
Finally, Figure 2 gives a breakdown of Lematus 20-ch dev set accuracy for individual languages, showing that data augmentation helps consistently, although results suggest diminishing returns.
Data Augmentation in Medium Resource Setting
To examine the extent to which augmented data can help in the medium resource setting of 10k continuous tokens of UDT used in previous work, we follow Bergmanis and Goldwater (2018) and train Lematus 20-ch models for all ten languages using the first 10k tokens of UDT and compare them with models trained on 10k tokens of UDT augmented with 10k UM types. To provide a better comparison of our results, we report both the type and the token level development set accuracy. First of all, TableĀ 4 shows that training on 10k continuous tokens of UDT yields a token level accuracy that is about 8% higher than when using the 1k types of UDT augmented with 10k UM typesāthe best-performing data augmentation systems (see TableĀ 3). Again, we believe this performance gap is due to the issues with the biased/incomplete data noted above. For example, we analyzed errors that were unique to the model trained on the Latvian augmented data and found that 41% of the errors were due to wrongly lemmatized words other than nouns, verbs, and adjectivesāthe three POSs with available inflection tables in UM. For instance, improperly lemmatized pronouns amounted to 14% of the errors on the Latvian dev set. TableĀ 4 also shows that UM examples with Wikipedia contexts benefit lemmatization not only in the low but also the medium resource setting, yielding statistically significant type and token level accuracy gains over models trained on 10k UDT continuous tokens alone (for both Unseen and All ).
5 Conclusion
We proposed a training data augmentation method that combines the efficiency of type-based learning and the expressive power of a context-sensitive lemmatization model. The proposed method uses Wikipedia sentences to provide contextualized examples for unambiguous inflection-lemma pairs from UniMorph tables. These examples are noisy and biased, but nevertheless they improve lemmatization accuracy on all ten languages both in low (1k) and medium (10k) resource settings. In particular, we showed that context is helpful, both overall and especially on unseen wordsāthe first work we know of to demonstrate improvements from context in a very low-resource setting.
Appendix A Lematus Training
Lematus is implemented using the Nematus machine translation toolkit111111https://github.com/EdinburghNLP/nematus. We use default training parameters of Lematus as specified by Bergmanis and Goldwater (2018) except for early stopping with patience Prechelt (1998) which we increase to 20. Similar to Bergmanis and Goldwater (2018) we use the first epochs as a burn-in period, after which we validate the current model by its lemmatization exact match accuracy on the first 3k instances of development set and save this model if it performs better than the previous best model. We choose a burn-in period of 20 and validation interval of 5 epochs for models that we train on datasets up to 2k instances and a burn-in period of 10 and validation interval of 2 epochs for others. As we work with considerably smaller datasets than Bergmanis and Goldwater (2018) we reduce the effective model size and increase the rate of convergence by tying the input embeddings of the encoder, the decoder and the softmax output embeddings Press and Wolf (2017).
Appendix B Data Preparation
Wikipedia database dumps contain XML structured articles that are formatted using the wikitext markup language. To obtain wordforms in their sentence context we 1) use WikiExtractor121212https://github.com/attardi/wikiextractor to extract plain text from Wikipedia database dumps, followed by scripts from Moses statistical machine translation system131313https://github.com/moses-smt/mosesdecoder (Koehn etĀ al., 2007) to 2) split text into sentences (split-sentences.perl), and 3) extract separate tokens (tokenizer.perl). Finally, we shuffle the extracted sentences to encourage homogeneous type distribution across the entire text.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Nivre et al. (2017) Joakim Nivre et al. 2017. Universal dependencies 2.0 ā Co NLL 2017 shared task development and test data. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University.
- 2Bergmanis and Goldwater (2018) Toms Bergmanis and Sharon Goldwater. 2018. Context Sensitive Neural Lemmatization with Lematus. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies .
- 3Bergmanis et al. (2017) Toms Bergmanis, Katharina Kann, Hinrich Schütze, and Sharon Goldwater. 2017. Training Data Augmentation for Low-Resource Morphological Inflection. In Proceedings of the Co NLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection , Vancouver, Canada. Association for Computational Linguistics.
- 4Chakrabarty et al. (2017) Abhisek Chakrabarty, Onkar Arun Pandit, and Utpal Garain. 2017. Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1481ā1491, Vancouver, Canada. Association for Computational Linguistics.
- 5ChrupaÅa et al. (2008) Grzegorz ChrupaÅa, Georgiana Dinu, and Josef van Genabith. 2008. Learning Morphology with Morfette. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LRECā08) , Marrakech, Morocco. European Language Resources Association (ELRA).
- 6Cotterell et al. (2016) Ryan Cotterell, Christo Kirov, John Sylak-Glassman, David Yarowsky, Jason Eisner, and Mans Hulden. 2016. The SIGMORPHON 2016 shared taskāmorphological reinflection. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology , pages 10ā22.
- 7Dreyer et al. (2008) Markus Dreyer, Jason R Smith, and Jason Eisner. 2008. Latent-Variable Modeling of String Transductions with Finite-State Methods. In Proceedings of the conference on empirical methods in natural language processing , pages 1080ā1089. Association for Computational Linguistics.
- 8Garrette et al. (2013) Dan Garrette, Jason Mielens, and Jason Baldridge. 2013. Real-world semi-supervised learning of pos-taggers for low-resource languages. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume 1, pages 583ā592.
