CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection   in 52 Languages

Ryan Cotterell; Christo Kirov; John Sylak-Glassman; G\'eraldine; Walther; Ekaterina Vylomova; Patrick Xia; Manaal Faruqui; Sandra K\"ubler,; David Yarowsky; Jason Eisner; Mans Hulden

arXiv:1706.09031·cs.CL·July 6, 2017

CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages

Ryan Cotterell, Christo Kirov, John Sylak-Glassman, G\'eraldine, Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sandra K\"ubler,, David Yarowsky, Jason Eisner, Mans Hulden

PDF

TL;DR

The paper reports on the 2017 shared task on morphological reinflection across 52 languages, demonstrating that neural models can perform well even with limited data, but different approaches predict different forms, indicating room for improvement.

Contribution

It introduces a large-scale shared task for morphological reinflection in diverse languages and evaluates neural models' effectiveness under various resource conditions.

Findings

01

Neural models perform well with small datasets using appropriate biases.

02

Different data augmentation methods lead to different correct predictions.

03

High performance achievable with limited labeled data.

Abstract

The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by predicting all of the remaining inflected forms. Both sub-tasks included high, medium, and low-resource conditions. Sub-task 1 received 24 system submissions, while sub-task 2 received 3 system submissions. Following the success of neural sequence-to-sequence models in the SIGMORPHON 2016 shared task, all but one of the submissions included a neural component. The results show that high performance can be achieved with small training datasets, so long as models have appropriate inductive bias…

Tables1

Table 1. Table 8: Sub-task 2 results: Per-form accuracy (in %age points) and average Levenshtein distance from the correct form (in characters).

	High	Medium	Low
LMU-2	88.52/0.22	82.02/0.38	67.76/0.75
LMU-1	87.40/0.24	77.02/0.47	54.74/1.22
CU-1	67.77/0.75	60.94/1.03	47.89/1.67
baseline	76.87/0.51	65.84/0.83	50.14/1.28
oracle-e	94.11/*	88.70/*	75.84/*

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

CoNLL-SIGMORPHON 2017 Shared Task:

Universal Morphological Reinflection in 52 Languages

Ryan Cotterell1

Christo Kirov1

John Sylak-Glassman1

Géraldine Walther2

Ekaterina Vylomova3

Patrick Xia1

Manaal Faruqui4

and Sandra Kübler5

David Yarowsky1

Jason Eisner1

Mans Hulden6

Johns Hopkins University1 University of Zurich2 University of Melbourne3

Google4 Indiana University5 University of Colorado6

Abstract

The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by predicting all of the remaining inflected forms. Both sub-tasks included high, medium, and low-resource conditions. Sub-task 1 received 24 system submissions, while sub-task 2 received 3 system submissions. Following the success of neural sequence-to-sequence models in the SIGMORPHON 2016 shared task, all but one of the submissions included a neural component. The results show that high performance can be achieved with small training datasets, so long as models have appropriate inductive bias or make use of additional unlabeled data or synthetic data. However, different biasing and data augmentation resulted in non-identical sets of inflected forms being predicted correctly, suggesting that there is room for future improvement.

1 Introduction

Morphology interacts with both syntax and phonology. As a result, explicitly modeling morphology has been shown to aid a number of tasks in human language technology (HLT), including machine translation (MT) Dyer et al. (2008), speech recognition Creutz et al. (2007), parsing Seeker and Çetinoǧlu (2015), keyword spotting Narasimhan et al. (2014), and word embedding Cotterell et al. (2016b). Dedicated systems for modeling morphological patterns and complex word forms have received less attention from the HLT community than tasks that target other levels of linguistic structure. Recently, however, there has been a surge of work in this area Durrett and DeNero (2013); Ahlberg et al. (2014); Nicolai et al. (2015); Faruqui et al. (2016), representing a renewed interest in morphology and the potential to use advances in machine learning to attack a fundamental problem in string-to-string transformations: the prediction of one morphologically complex word form from another. This increased interest in morphology as an independent set of problems within HLT arrives at a particularly opportune time, as morphology is also undergoing a methodological renewal within theoretical linguistics where it is moving towards increased interdisciplinary work and quantitative methodologies Moscoso del Prado Martín et al. (2004); Milin et al. (2009); Ackerman et al. (2009); Sagot and Walther (2011); Ackerman and Malouf (2013); Baayen et al. (2013); Blevins (2013); Pirrelli et al. (2015); Blevins (2016). Pushing the HLT research agenda forward in the domain of morphology promises to lead to mutually highly beneficial dialogue between the two fields.

Rich morphology is the norm among the languages of the world. The linguistic typology database WALS shows that 80% of the world's languages mark verb tense through morphology while 65% mark grammatical case Haspelmath et al. (2005). The more limited inflectional system of English may help to explain the fact that morphology has received less attention in the computational literature than it is arguably due.

The CoNLL-SIGMORPHON 2017 shared task worked to promote the development of robust systems that can learn to perform cross-linguistically reliable morphological inflection and morphological paradigm cell filling using varying amounts of training data. We note that this is also the first CoNLL-hosted shared task to focus on morphology. The task itself featured training and development data from 52 languages representing a range of language families. Many of the languages included were extremely low-resource, e.g., Quechua, Navajo, and Haida. The chosen languages also encompassed diverse morphological properties and inflection processes. Whenever possible, three data conditions were given for each language: low, medium, and high. In the inflection sub-task, these corresponded to seeing 100 examples, 1,000 examples, and 10,000 examples respectively in the training data for almost all languages. The results show that encoder-decoder recurrent neural network models (RNNs) can perform very well even with small training sets, if they are augmented with various mechanisms to cope with the low-resource setting. The shared task training, development, and test data are released publicly.111https://github.com/sigmorphon/conll2017

2 Task and Evaluation Details

This year's shared task contained two sub-tasks, which represented slightly different learning scenarios that might be faced by an HLT engineer or (roughly speaking) a human learner. Beyond manually vetted222Thanks to: Iñaki Alegria, Gerlof Bouma, Zygmunt Frajzyngier, Chris Harvey, Ghazaleh Kazeminejad, Jordan Lachler, Luciana Marques, and Ruben Urizar. data for training, development and test, monolingual corpus data (Wikipedia dumps) was also provided for both of the sub-tasks. Figure 1 illustrates the two tasks and defines some terminology.

The CoNLL-SIGMORPHON 2017 shared task is the second shared task in a series that began with the SIGMORPHON 2016 shared task on morphological reinflection Cotterell et al. (2016a). In contrast to 2016, it happens that both of the 2017 sub-tasks actually involve only inflection, not reinflection.333Cotterell et al. (2016a) defined the term: “Systems developed for the 2016 Shared Task had to carry out reinflection of an already inflected form. This involved analysis of an already inflected word form, together with synthesis of a different inflection of that form.” In 2016, sub-task 1 involved only inflection while sub-tasks 2–3 required reinflection. Nonetheless, we kept ``reinflection'' in this year's title to make it easier to refer to the series of tasks.

2.1 Sub-Task 1: Inflected Form from Lemma

The first sub-task in Figure 1 required morphological generation with sparse training data, something that can be practically useful for MT and other downstream tasks in NLP. Here, participants were given examples of inflected forms as shown in Table 1. Each test example asked them to produce some other inflected form when given a lemma and a bundle of morphosyntactic features.

The training data was sparse in the sense that it included only a few inflected forms from each lemma. That is, as in human L1 learning, the learner does not necessarily observe any complete paradigms in a language where the paradigms are large (e.g., dozens of inflected forms per lemma).444 Of course, human L1 learners do not get to observe explicit morphological feature bundles for the types that they observe. Rather, they analyze inflected tokens in context to discover both morphological features (including inherent features such as noun gender Arnon and Ramscar (2012)) and paradigmatic structure (number of forms per lemma, number of expressed featural contrasts such as tense, number, person…).

Key points:

Our sub-task 1 is similar to sub-task 1 of the SIGMORPHON 2016 shared task Cotterell et al. (2016a), but with structured inflectional tags Sylak-Glassman et al. (2015a), learning curve assessment, and many new typologically diverse languages, including low-resource languages. 2. 2.

The task is inflection: Given an input lemma and desired output tag, participants had to generate the correct output inflected form (a string). 3. 3.

The supervised training data consisted of individual forms (Table 1) that were sparsely sampled from a large number of paradigms. 4. 4.

Forms that are empirically more frequent were more likely to appear in both training and test data (see § 3 for details). 5. 5.

Unannotated corpus data was also provided to participants. 6. 6.

Systems were evaluated after training on $10^{2}$ , $10^{3}$ , and $10^{4}$ forms.

2.2 Sub-Task 2: Paradigm Completion

The second sub-task in Figure 1 focused on paradigm completion, also known as ``the paradigm cell filling problem'' Ackerman et al. (2009).

Here, participants were given a few complete inflectional paradigms as training data. At test time, partially filled paradigms, i.e. paradigms with significant gaps in them, were to be completed by filling out the missing cells. Table 2 gives examples.

Thus, sub-task 2 requires predicting many inflections of the same lemma. Recall that sub-task 1 also required the system to predict several inflections of the same lemma (when they appear as separate examples in test data). However, in sub-task 2, one of our test-time evaluation metrics (§ 2.3) is full-paradigm accuracy. Also, the sub-task 2 training data provides full paradigms, in contrast to sub-task 1 where it included only a few inflected forms per lemma. Finally, at test time, sub-task 2 presents each lemma along with some of its inflected forms, which is potentially helpful if the lemma had not appeared previously in training data.

Apart from the theoretical interest in this problem Ackerman and Malouf (2013), this sub-task is grounded in the practical problem of extrapolation of basic resources for a language, where only a few complete paradigms may be available from a native speaker informant Sylak-Glassman et al. (2016) or a reference grammar. L2 classroom instruction also asks human students to memorize example paradigms and generalize from them.

Key points:

The training data consisted of complete paradigms. 2. 2.

Not all paradigms within a language have the same shape. A noun lemma will have a different set of cells than a verb lemma does, and verbs of different classes (e.g., lexically perfective vs. imperfective) may also have different sets of cells. 3. 3.

The task was paradigm completion: given a sparsely populated paradigm, participants should generate the inflected forms (strings) for all missing cells. 4. 4.

The task simulates learning from compiled grammatical resources and inflection tables, or learning from a limited time with a native-language informant in a fieldwork scenario. 5. 5.

Three training sets were given, building up in size from only a few complete paradigms to a large number (dozens).

2.3 Evaluation

Each team participating in a given sub-task was asked to submit $156$ versions of their system, where each version was trained using a different training set ( $3$ training sizes $\times$ $52$ languages) and its corresponding development set. We evaluated each submitted system on its corresponding test set, i.e., the test set for its language.

We computed three evaluation metrics: (i) Overall 1-best test-set accuracy, i.e., is the predicted paradigm cell correct? (ii) average Levenshtein distance, i.e., how badly does the predicted form disagree with the answer? (iii) Full-paradigm accuracy, i.e., is the complete paradigm correct? This final metric only truly makes sense in sub-task 2, where full paradigms are given for evaluation. For each sub-task, the three data conditions (low, medium, and high) resulted in a learning curve. For each system in each condition, we report the average metrics across all 52 languages.

3 Data

3.1 Languages

The data for the shared task was highly multilingual, comprising 52 unique languages. Data for 47 of the languages came from the English edition of Wiktionary, a large multi-lingual crowd-sourced dictionary containing morphological paradigms for many lemmata.555https://en.wiktionary.org/(08-2016 snapshot) Data for Khaling, Kurmanji Kurdish, and Sorani Kurdish was created as part of the Alexina project Walther et al. (2013, 2010); Walther and Sagot (2010).666https://gforge.inria.fr/projects/alexina/ Novel data for Haida, a severely endangered North American language isolate, was prepared by Jordan Lachler (University of Alberta). The Basque language data was extracted from a manually designed finite-state morphological analyzer Alegria et al. (2009).

The shared task language set is genealogically diverse, including languages from 10 language stocks. Although the majority of the languages are Indo-European, we also include two language isolates (Haida and Basque) along with languages from Athabaskan (Navajo), Kartvelian (Georgian), Quechua, Semitic (Arabic, Hebrew), Sino-Tibetan (Khaling), Turkic (Turkish), and Uralic (Estonian, Finnish, Hungarian, and Northern Sami). The shared task language set is also diverse in terms of morphological structure, with languages which use primarily prefixes (Navajo), suffixes (Quechua and Turkish), and a mix, with Spanish exhibiting internal vowel variations along with suffixes and Georgian using both infixes and suffixes. The language set also exhibits features such as templatic morphology (Arabic, Hebrew), vowel harmony (Turkish, Finnish, Hungarian), and consonant harmony (Navajo) which require systems to learn non-local alternations. Finally, the resource level of the languages in the shared task set varies greatly, from major world languages (e.g. Arabic, English, French, Spanish, Russian) to languages with few speakers (e.g. Haida, Khaling).

3.2 Data Format

For each language, the basic data consists of triples of the form (lemma, feature bundle, inflected form), as in Table 1. The first feature in the bundle always specifies the core part of speech (e.g., verb). All features in the bundle are coded according to the UniMorph Schema, a cross-linguistically consistent universal morphological feature set Sylak-Glassman et al. (2015a, b).

3.3 Extraction from Wiktionary

For each of the 47 Wiktionary languages, Wiktionary provides a number of tables, each of which specifies the full inflectional paradigm for a particular lemma. These tables were initially extracted via a multi-dimensional table parsing strategy Kirov et al. (2016); Sylak-Glassman et al. (2015a).

As noted in § 2.2, different paradigms may have different shapes. To prepare the shared task data, each language's parsed tables from Wiktionary were grouped according to their tabular structure and number of cells. Each group represents a different type of paradigm (e.g., verb). We used only groups with a large number of lemmata, relative to the number of lemmata available for the language as a whole. For each group, we associated a feature bundle with each cell position in the table, by manually replacing the prose labels describing grammatical features (e.g. ``accusative case'') with UniMorph features (e.g. acc). This allowed us to extract triples as described in the previous section.

By applying this process across the 47 languages, we constructed a large multilingual dataset that refines the parsed tables from previous work. This dataset was sampled to create appropriately-sized data for the shared task, as described in § 3.4.777Full, unsampled Wiktionary parses are made available at unimorph.org on a rolling basis. Full and sampled dataset sizes by language are given in Table 3.

Systematic syncretism is collapsed in Wiktionary. For example, in English, feature bundles do not distinguish between different person/number forms of past tense verbs, because they are identical.888In this example, Wiktionary omits the single exception: the lemma be distinguishes between past tenses was and were. Thus, the past-tense form went appears only once in the table for go, not six times, and gives rise to only one triple, whose feature bundle specifies past tense but not person and number.

3.4 Sampling the Train-Dev-Test Splits

From each language's collection of paradigms, we sampled the training, development, and test sets as follows. These datasets can be obtained from http://www.sigmorphon.org/conll2017.

Our first step was to construct probability distributions over the (lemma, feature bundle, inflected form) triples in our full dataset. For each triple, we counted how many tokens the inflected form has in the February 2017 dump of Wikipedia for that language. Note that this simple ``string match'' heuristic overestimates the count, since strings are ambiguous: not all of the counted tokens actually render that feature bundle.999For example, in English, any token of the string walked will be double-counted as both the past tense and the past participle of the lemma walk. This problem holds for all regular English verbs. Similarly, when we are counting the present-tense tokens lay of the lemma lay, we will also include tokens of the string lay that are actually the past tense of lie, or are actually the adjective or noun senses of lay. The alternative to double-counting each ambiguous token would have been to use EM to split the token’s count of 1 unequally among its possible analyses, in proportion to their estimated prior probabilities Cotterell et al. (2015).

From these counts, we estimated a unigram distribution over triples, using Laplace smoothing (add-1 smoothing). We then sampled 12000 triples without replacement from this distribution. The first 100 were taken as the low-resource training set for sub-task 1, the first 1000 as the medium-resource training set, and the first 10000 as the high-resource training set. Note that these training sets are nested, and that the highest-count triples tend to appear in the smaller training sets.

The final 2000 triples were randomly shuffled and then split in half to obtain development and test sets of 1000 forms each. The final shuffling was performed to ensure that the development set is similar to the test set. By contrast, the development and test sets tend to contain lower-count triples than the training set.101010This is a realistic setting, since supervised training is usually employed to generalize from frequent words that appear in annotated resources to less frequent words that do not. Unsupervised learning methods also tend to generalize from more frequent words (which can be analyzed more easily by combining information from many contexts) to less frequent ones. In those languages where we have less than 12000 total forms, we omit the high-resource training set (all languages have at least 3000 forms).

To sample the data for sub-task 2, we perform a similar procedure. For each paradigm in our full dataset, we counted the number of tokens in Wikipedia that matched any of the inflected forms in the paradigm. From these counts, we estimated a unigram distribution over paradigms, using Laplace smoothing. We sampled 300 paradigms without replacement from this distribution. The low-resource training sets contain the first 10 paradigms, the medium-resource training set contains the first 50, and high-resource training set contains the first 200. Again, these training sets are nested. Note that since different languages have paradigms of different sizes, the actual number of training exemplars may differ drastically.

With the same motivation as before, we shuffled the remaining 100 forms and took the first 50 as development and the next 50 as test. (In those languages with fewer than 300 forms, we again omitted the high-resource training setting.) For each development or test paradigm, we chose about $\frac{1}{5}$ of the slots to provide to the system as input along with the lemma, asking the system to predict the remaining $\frac{4}{5}$ . We determined which cells to keep by independently flipping a biased coin with probability $0.2$ for each cell.

Because of the count overestimates mentioned above, our sub-task 1 dataset overrepresents triples where the inflected form (the answer) is ambiguous, and our sub-task 2 dataset overrepresents paradigms that contain ambiguous inflected forms. The degree of ambiguity varied among languages: the average number of triples per inflected form string ranged from 1.00 in Sorani to 2.89 in Khaling, with an average of 1.43 across all languages. Despite this distortion of true unigram counts, we believe that our datasets captured a sufficiently broad sample of the feature combinations for every language.

4 Previous Work

Most recent work in inflection generation has focused on sub-task 1, i.e., generating inflected forms from the lemma. Numerous, methodologically diverse approaches have been published. We highlight a representative sample of recent work. Durrett and DeNero (2013) heuristically extracted transformation rules and trained a semi-Markov model Sarawagi and Cohen (2004) to learn when to apply them to the input. Nicolai et al. (2015) trained a discriminative string-to-string monotonic transduction tool—DirecTL+ Jiampojamarn et al. (2008)—to generate inflections. Ahlberg et al. (2014) reduced the problem to multi-class classification, where they used finite-state techniques to first generalize inflectional patterns and then trained a feature-rich classifier to choose the optimal such pattern to inflect unseen words Ahlberg et al. (2015). Finally, Malouf (2016), Faruqui et al. (2016) and Kann and Schütze (2016) proposed a neural-based sequence-to-sequence models Sutskever et al. (2014), with Kann and Schütze making use of an attention mechanism Bahdanau et al. (2015). Overall, the neural approaches have generally been found to be the most successful.

Some work has also focused on scenarios similar to sub-task 2. For example, Dreyer and Eisner (2009) modeled the distribution over the paradigms of a language as a Markov Random Field (MRF), where each cell is represented as a string-valued random variable. The MRF's factors are specified as weighted finite-state machines of the form given by Dreyer et al. (2008). Building upon this, Cotterell et al. (2015) proposed using a Bayesian network where both lemmata (repeated within a paradigm) and affixes (repeated across paradigms) were encoded as string-valued random variables. That work required its finite-state transducers to take a more restricted form Cotterell et al. (2014) for computational reasons. Finally, Kann et al. (2017a) proposed a multi-source sequence-to-sequence network, allowing a neural transducer to exploit multiple source forms simultaneously.

SIGMORPHON 2016 Shared Task.

Last year, the SIGMORPHON 2016 shared task (http://sigmorphon.org/sharedtask) focused on 10 languages (including 2 surprise languages). As for the present 2017 task, most of the 2016 data was derived from Wiktionary. The 2016 shared task had submissions from 9 competing teams with members from 11 universities. As mentioned in § 2.1, our sub-task 1 is an extension of sub-task 1 from 2016. The other sub-tasks in 2016 focused on the more general reinflection problem, where systems had to learn to map from any inflected form to any other with varying degrees of annotations. See Cotterell et al. (2016a) for details.

5 The Baseline System

The shared task provided a baseline system to participants that addressed both tasks and all languages. The system was designed for speed of application and also for adequate accuracy with little training data, in particular in the low and medium data conditions. The design of the baseline was inspired by the University of Colorado's submission Liu and Mao (2016) to the SIGMORPHON 2016 shared task.

5.1 Alignment

For each (lemma, feature bundle, inflected form) triple in training data, the system initially aligns the lemma with the inflected form by finding the minimum-cost edit path. Costs are computed with a weighted scheme such that substitutions have a slightly higher cost (1.1) than insertions or deletions (1.0). For example, the German training data pair schielen-geschielt `to squint' (going from the lemma to the past participle) is aligned as:

--schielen geschielt-

The system now assumes that each aligned pair can be broken up into a prefix, stem and a suffix, based on where the inputs or outputs have initial or trailing blanks after alignment. We assume that initial or trailing blanks in either input or output reflect boundaries between a prefix and a stem, or a stem and a suffix. This allows us to divide each training example into three parts. Using the example above, the pairs would be aligned as follows, after padding the edges with $-symbols:

[TABLE]

5.2 Inflection Rules

From this alignment, the system extracts a prefix-changing rule based on the prefix pairing, as well as a set of suffix-changing rules based on suffixes of the stem+suffix pairing. The example alignment above yields the eight extracted suffix-modifying rules

[TABLE]

as well as the prefix-modifying rule \rightarrowge.

Since these rules were obtained from the triple (schielen, V;V.PTCP;PST, geschielt), they are associated with a token of the feature bundle V;V.PTCP;PST.

5.3 Generation

At test time, to inflect a lemma with features, the baseline system applies rules associated with training tokens of the precise feature bundle. There is no generalization across bundles that share features.

Specifically, the longest-matching suffix rule associated with the feature bundle is consulted and applied to the input form. Ties are broken by frequency, in favor of the rule that has occurred most often with this feature bundle. After this, the prefix rule that occurred most often with the bundle is likewise applied. That is, the prefix-matching rule has no longest-match preference, while the suffix-matching rule does.

For example, to inflect kaufen `to buy' with the features V;V.PTCP;PST, using the single example above as training data, we would find that the longest matching stored suffix-rule is en\rightarrow $t$ , which would transform kaufen into an intermediate form kauft, after which the most frequent prefix-rule, \rightarrowge would produce the final output gekauft. If no rules have been associated with a particular feature bundle (as often happens in the low data condition), the inflected form is simply taken to be a copy of the lemma.

In sub-task 2, paradigm completion, the baseline system simply repeats the sub-task 1 method and generates all the missing forms independently from the lemma. It does not take advantage of the other forms that are presented in the partially filled paradigm.

In addition to the above, the baseline system uses a heuristic to place a language into one of two categories: largely prefixing or largely suffixing. Some languages, such as Navajo, are largely prefixing and have more complex changes in the left periphery of the input rather than at the right. However, in the method described above, the operation of the prefix rules is more restricted than that of the suffix rules: prefix rules tend to perform no change at all, or insert or delete a prefix. For largely prefixing languages, the method performs better when operating with reversed strings. Classifying a language into prefixing or suffixing is done by simply counting how often there is a prefix change vs. suffix change in going from the lemma form to the inflected form in the training data. Whenever a language is found to be largely prefixing, the system works with reversed strings throughout to allow more expressive changes in the left edge of the input.

6 System Descriptions

The CoNLL-SIGMORPHON 2017 shared task received submissions from 11 teams with members from 15 universities and institutes (Table 5). Many of the teams submitted more than one system, yielding a total of 25 unique systems entered including the baseline system.

In contrast to the 2016 shared task, all but one of the submitted systems included a neural component. Despite the relative uniformity of the submitted architectures, we still observed large differences in the individual performances. Rather than differences in architecture, a major difference this year was the various methods for supplying the neural network with auxiliary training data. For ease of presentation, we break down the systems into the features of their system (see Table 6) and discuss the systems that had those features. In all cases, further details of the methods can be found in the system description papers, which are cited in Table 5.

Neural Parameterization.

All systems except for the EHU team employed some form of a neural network. Moreover, all teams except for SU-RUG, which employed a convolutional neural network, made use of some form of gated recurrent network—either a gated recurrent network (GRU) Chung et al. (2014) or long short-term memory (LSTM) Hochreiter and Schmidhuber (1997). In these neural models, a common strategy was to feed in the morphological tag of the form to be predicted along with the input into the network, where each subtag was its own symbol.

Hard Alignment versus Soft Attention.

Another axis, along which the systems differ is the use of hard alignment, over soft attention. The neural attention mechanism was introduced in Bahdanau et al. (2015) for neural machine translation (NMT). In short, these mechanisms avoid the necessity of encoding the input word into a fixed length vector, by allowing the decoder to attend to different parts of the inputs. Just as in NMT, the attention mechanism has led to large gains in morphological inflection. The CMU, CU, IIT (BHU), LMU, UE-LMU, UF and UTNII systems all employed such mechanisms.

An alternative to soft attention is hard, monotonic alignment, i.e., a neural parameterization of a traditional finite-state transduction system. These systems enforce a monotonic alignment between source and target forms. In the 2016 shared task (see Cotterell et al., 2016a, Table 6) such a system placed second Aharoni et al. (2016), and this year's winning system—CLUZH—was an extension of that one. (See, also, Aharoni and Goldberg (2017) for a further explication of the technique and Rastogi et al. (2016) for discussion of a related neural parameterization of a weighted finite-state machine.) Their system allows for explicit biasing towards a copy action that appears useful in the low-resource setting. Despite its neural parameterization, the CLUZH system is most closely related to the systems of UA and EHU, which train weighted finite-state transducers, albeit with a log-linear parameterization.

Reranking.

Reranking the output of a weaker system was a tack taken by two systems: ISI and UA. The ISI system started with a heuristically induced candidate set, using the edit tree approach described by Chrupała et al. (2008), and then chose the best edit tree. This approach is effectively a neuralized version of the lemmatizer proposed in Müller et al. (2015) and, indeed, was originally intended for that task Chakrabarty et al. (2017). The UA team, following their 2016 submission, proposed a linear reranking on top of the $k$ -best output of their transduction system.

Data Augmentation.

Many teams made use of auxiliary training data—unlabeled or synthetic forms. Some teams leveraged the provided Wikipedia corpora (see § 3). The UE-LMU team used these unlabeled corpora to bias their methods towards copying by transducing an unlabeled word to itself. The same team also explored a similar setup that instead learned to transduce random strings to themselves, and found that using random strings worked almost as well as words that appeared in unlabeled corpora. CMU used a variational autoencoder and treated the tags of unannotated words in the Wikipedia corpus as latent variables (see Zhou and Neubig (2017b) for more details). Other teams attempted to get silver-standard labels for the unlabeled corpora. For example, the UA team trained a tagger on the given training examples, and then tagged the corpus with the goal to obtain additional instances, while the UE-LMU team used a series of unsupervised heuristics. The CU team—which did not make use of external resources—hallucinated more training data by identifying suffix and prefix changes in the given training pairs and then using that information to create new artificial training pairs. The LMU submission also experimented with hand-written rules to artificially generate more data. It seems likely that the primary difference in the performance of the various neural systems lay in these strategies for the creation of new data to train the parameters, rather than in the neural architectures themselves.

7 Performance of the Systems

Relative system performance is described in Tables 7 and 8, which show the average per-language accuracy of each system by resource condition, for each of the sub-tasks. The table reflects the fact that some teams submitted more than one system (e.g. LMU-1 & LMU-2 in the table). Learning curves for each language across conditions are shown in Table 9, which indicates the best per-form accuracy achieved by a submitted system. Full results can be found in LABEL:appends, including full-paradigm accuracy.

Three teams exploited external resources in some form: UA, CMU, and UE-LMU. In general, any relative performance gained was minimal. The CMU system was outranked by several systems that avoided external resource use in the High and Medium conditions in which it competed. UE-LMU only submitted a system that used additional resources in the Medium condition, and saw gains of $\sim$ %1 compared to their basic system, while it was still outranked overall by CLUZH. In the Low condition, UA saw gains of $\sim$ %3 using external data. However, all UA submissions were limited to a small handful of languages.

All but one of the systems submitted were neural. As expected given the results from SIGMORPHON 2016, these systems perform very well when in the High training condition where data is relatively plentiful. In the Low and Medium conditions, however, standard encoder-decoder architectures perform worse than the baseline using only the training data provided. Teams that beat the baseline succeeded by biasing networks towards the correct solutions through pre-training on synthetic data designed to capture the overall inflectional patterns in a language. As seen in Table 9, these techniques worked better for some languages than for others. Languages with smaller, more regular paradigms were handled well (e.g., English sub-task 1 low-resource accuracy was at 90%). Languages with more complex systems, like Latin, proved more challenging (the best system achieved only 19% accuracy in the low condition). For these languages, it is possible that the relevant variation required to learn a best per-form inflectional pattern was simply not present in the limited training data, and that a language-specific learning bias was required.

Even though the top-ranked systems do well on their own, different systems may contain some amount of complementary information, so that an ensemble over multiple approaches has a chance to improve accuracy. We present an upper bound on the possible performance of such an ensemble. Table 7 and Table 8 include an Ensemble Oracle'' system (oracle-e) that gives the correct answer if *any* of the submitted systems is correct. The oracle performs significantly better than any one system in both the Medium ($\sim$10%) and Low ($\sim$15%) conditions. This suggests that the different strategies used by teams to bias'' their systems in an effort to make up for sparse data lead to substantially different generalization patterns.

For sub-task 1, we also present a second ``Feature Combination'' Oracle (oracle-fc) that gives the correct answer for a given test triple iff its feature bundle appeared in training (with any lemma). Thus, oracle-fc provides an upper bound on the performance of systems that treat a feature bundle such as V;SBJV;FUT;3;PL as atomic. In the low-data condition, this upper bound was only 71%, meaning that 29% of the test bundles had never been seen in training data. Nonetheless, systems should be able to make some accurate predictions on this 29% by decomposing each test bundle into individual morphological features such as FUT (future) and PL (plural), and generalizing from training examples that involve those features. For example, a particular feature or sub-bundle might be realized as a particular affix. Several of the systems treated each individual feature as a separate input to the recurrent network, in order to enable this type of generalization. In the medium data condition for some languages, these systems sometimes far surpassed oracle-fc. The most notable example of this is Basque, where oracle-fc produced a 47% accuracy while six of the submitted systems produced an accuracy of 85% or above. Basque is an extreme example with very large paradigms for the verbs that inflect in the language (only a few dozen common ones do). This result demonstrates the ability of the neural systems to generalize and correctly inflect according to unseen feature combinations.

8 Future Directions

As regards morphological inflection, there is a plethora of future directions to consider. First, one might consider morphological transductions over pronunciations, rather than spellings. This is more challenging in the many languages (including English) where the orthography does not reflect the phonological changes that accompany morphological processes such as affixation. Orthography usually also does not reflect predictable allophonic distinctions in pronunciation Sampson (1985), which one might attempt to predict, such as the difference in aspiration of /t/ in English [thAp] (top) vs. [stAp] (stop).

A second future direction involves the effective incorporation of external unannotated monolingual corpora into the state-of-the-art inflection or reinflection systems. The best systems in our competition did not make use of external data and those that did make heavy use of such data, e.g., the CMU team, did not see much gain.The best way to use external corpora remains an open question; we surmise that they can be useful, especially in the lower-resource cases. A related line of inquiry is the incorporation of cross-lingual information, which Kann et al. (2017b) did find to be helpful.

A third direction revolves around the efficient elicitation of morphological information (i.e., active learning). In the low-resource section, we asked our participants to find the best approach to generate new forms given existing morphological annotation. However, it remains an open question, which of the cells in a paradigm are best to collect annotation for in the first place. Likely, it is better to collect diagnostic forms that are closer to principal parts of the paradigm Finkel and Stump (2007); Ackerman et al. (2009); Montermini and Bonami (2013); Cotterell et al. (2017)as these will contain enough information such that the remaining transformations are largely deterministic. Experimental studies however suggest that speakers also strongly rely on pattern frequencies for inferring unknown forms Seyfarth et al. (2014). Another interesting direction would therefore also include the organization of data according to plausible real frequency distributions (especially in spoken data) and exploring possibly varying learning strategies associated with lexical items of various frequencies.

Finally, there is a wide variety of other tasks involving morphology. While some of these have had a shared task, e.g., the parsing of morphologically-rich languages Tsarfaty et al. (2010) and unsupervised morphological segmentation Kurimo et al. (2010), many have not, e.g., supervised morphological segmentation and morphological tagging. A key purpose of shared tasks in the NLP community is the preparation and release of standardized data sets for fair comparison among methods. Future shared tasks in other areas of computational morphology would seem in order, giving the overall effectiveness of shared tasks in unifying research objectives in subfields of NLP, and as a starting point for possible cross-over with cognitively-grounded theoretical and quantitative linguistics.

9 Conclusion

The CoNLL-SIGMORPHON shared task provided an evaluation on 52 languages, with large and small datasets, of systems for inflection and paradigm completion—two core tasks in computational morphological learning. On sub-task 1 (inflection), 24 systems were submitted, while on sub-task 2 (paradigm completion), 3 systems were submitted. All but one of the systems used rather similar neural network models, popularized by the SIGMORPHON shared task in 2016.

The results reinforce the conclusions of the 2016 shared task that encoder-decoder architectures perform strongly when training data is plentiful, with exact-match accuracy on held-out forms surpassing 90% on many languages; we note there was a shortage of non-neural systems this year to compare with. In addition, and contrary to common expectation, many participants showed that neural systems can do reasonably well even with small training datasets. A baseline sequence-to-sequence model achieves close to zero accuracy: e.g., Silfverberg et al. (2017) reported that all the team's neural models on the low data condition delivered accuracies in the 0-1% range without data augmentation, and other teams reported similar findings. However, with judicious application of biasing and data augmentation techniques, the best neural systems achieved over 50% exact-match prediction of inflected form strings on 100 examples, and 80% on 1,000 examples, as compared to 38% for a baseline system that learns simple inflectional rules. It is hard to say whether these are ``good'' results in an absolute sense. An interesting experiment would be to pit the small-data systems against human linguists who do not know the languages, to see whether the systems are able to identify the predictive patterns that humans discover (or miss).

An oracle ensembling of all systems shows that there is still much room for improvement, in particular in low-resource settings. We have released the training, development, and test sets, and expect these datasets to provide a useful benchmark for future research into learning of inflectional morphology and string-to-string transduction.

Acknowledgements

The first author would like to acknowledge the support of an NDSEG fellowship. Google provided support for the shared task in the form of an award. Several authors (CK, DY, JSG, MH) were supported in part by the Defense Advanced Research Projects Agency (DARPA) in the program Low Resource Languages for Emergent Incidents (LORELEI) under contract No. HR0011-15-C-0113. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA).

Bibliography71

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ackerman et al. (2009) Farrell Ackerman, James P. Blevins, and Robert Malouf. 2009. Parts and wholes: Patterns of relatedness in complex mophological systems and why they matter . In James P. Blevins and Juliette Blevins, editors, Analogy in grammar: Form and acquisition , pages 54–82. Oxford University Press, Oxford.
2Ackerman and Malouf (2013) Farrell Ackerman and Robert Malouf. 2013. Morphological organization: The low conditional entropy conjecture . Language , 89:429–464.
3Aharoni and Goldberg (2017) Roee Aharoni and Yoav Goldberg. 2017. Morphological inflection generation with hard monotonic attention . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Vancouver, Canada. Association for Computational Linguistics.
4Aharoni et al. (2016) Roee Aharoni, Yoav Goldberg, and Yonatan Belinkov. 2016. Improving sequence to sequence learning for morphological inflection generation: The BIU-MIT systems for the SIGMORPHON 2016 shared task for morphological reinflection . In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology , pages 41–48, Berlin, Germany. Association for Computational Linguistics.
5Ahlberg et al. (2014) Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2014. Semi-supervised learning of morphological paradigms and lexicons . In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics , pages 569–578, Gothenburg, Sweden. Association for Computational Linguistics.
6Ahlberg et al. (2015) Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2015. Paradigm classification in supervised learning of morphology. In Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL , pages 1024–1029, Denver, CO. Association for Computational Linguistics.
7Alegria et al. (2009) Iñaki Alegria, Izaskun Etxeberria, Mans Hulden, and Montserrat Maritxalar. 2009. Porting Basque morphological grammars to foma , an open-source tool . In International Workshop on Finite-State Methods and Natural Language Processing , pages 105–113. Springer.
8Alegria and Etxeberria (2016) Iñaki Alegria and Izaskun Etxeberria. 2016. EHU at the SIGMORPHON 2016 shared task. A simple proposal: Grapheme-to-phoneme for inflection. In Proceedings of the 2016 Meeting of SIGMORPHON , Berlin, Germany. Association for Computational Linguistics.