Discontinuous Constituency Parsing with a Stack-Free Transition System and a Dynamic Oracle
Maximin Coavoux, Shay B. Cohen

TL;DR
This paper presents a new stack-free transition system for discontinuous constituency parsing using a set of parsing items, enabling efficient construction of trees and introducing a dynamic oracle, achieving state-of-the-art results.
Contribution
It introduces a novel transition system with constant-time access and a dynamic oracle for discontinuous constituency parsing, improving efficiency and accuracy.
Findings
Achieves state-of-the-art results on English and German treebanks.
Constructs any discontinuous tree in exactly 4n - 2 transitions.
Introduces the first dynamic oracle for this parsing task.
Abstract
We introduce a novel transition system for discontinuous constituency parsing. Instead of storing subtrees in a stack --i.e. a data structure with linear-time sequential access-- the proposed system uses a set of parsing items, with constant-time random access. This change makes it possible to construct any discontinuous constituency tree in exactly transitions for a sentence of length . At each parsing step, the parser considers every item in the set to be combined with a focus item and to construct a new constituent in a bottom-up fashion. The parsing strategy is based on the assumption that most syntactic structures can be parsed incrementally and that the set --the memory of the parser-- remains reasonably small on average. Moreover, we introduce a provably correct dynamic oracle for the new transition system, and present the first experiments in discontinuous…
| Initial configuration | |||
|---|---|---|---|
| Goal configuration | |||
| Structural actions | Input | Output | Precondition |
| shift | , is even | ||
| combine- | , is even | ||
| Labelling actions | |||
| label-X | is odd | ||
| no-label | or , is odd | ||
| Even action | Set () | Focus () | Buffer | Odd action |
|---|---|---|---|---|
| {} | none | So what ’s a parent to do ? | ||
| sh | {} | {So} | what ’s a parent to do ? | no-label |
| sh | {{So}0} | {what} | ’s a parent to do ? | label-WHNP |
| sh | {{So}0, {what}1} | {’s} | a parent to do ? | no-label |
| sh | {{So}0, {what}1, {’s}2} | {a} | parent to do ? | no-label |
| sh | {{So}0, {what}1, {’s}2, {a}3} | {parent} | to do ? | no-label |
| comb-3 | {{So}0, {what}1, {’s}2} | {a parent} | to do ? | label-NP |
| comb-2 | {{So}0, {what}1} | {’s a parent} | to do ? | no-label |
| sh | {{So}0, {what}1, {’s a parent}2} | {to} | do ? | no-label |
| sh | {{So}0, {what}1, {’s a parent}2, {to}5} | {do} | ? | no-label |
| comb-1 | {{So}0, {’s a parent}2, {to}5} | {what do} | ? | label-VP |
| comb-5 | {{So}0, {’s a parent}2} | {what to do} | ? | label-VP |
| comb-2 | {{So}0} | {what ’s a parent to do} | ? | label-SQ |
| comb-0 | {} | {so what ’s a parent to do} | ? | no-label |
| sh | {{So what ’s a parent to do}0} | {?} | no-label | |
| comb-0 | {} | {So what ’s a parent to do ?} | label-SBARQ |
| DPTB | Tiger | Negra | |||||||
|---|---|---|---|---|---|---|---|---|---|
| F1 | Disc. F1 | POS | F1 | Disc. F1 | POS | F1 | Disc. F1 | POS | |
| static | 91.1 | 68.2 | 97.2 | 87.4 | 61.7 | 98.3 | 83.6 | 51.3 | 97.9 |
| dynamic | 91.4 | 70.9 | 97.2 | 87.6 | 62.5 | 98.4 | 84.0 | 54.0 | 98.0 |
| English (DPTB) | German (Tiger) | German (Negra) | ||||
| Model | F | Disc. F | F | Disc. F | F | Disc. F |
| Predicted POS tags or own tagging | ||||||
| This work, dynamic oracle | 90.9 | 67.3 | 82.5 | 55.9 | 83.2 | 56.3 |
| Coavoux et al. (2019),∗ gap, bi-LSTM | 91.0 | 71.3 | 82.7 | 55.9 | 83.2 | 54.6 |
| Stanojević and Garrido Alhama (2017),∗ swap, stack/tree-LSTM | 77.0 | |||||
| Coavoux and Crabbé (2017a), sr-gap, perceptron | 79.3 | |||||
| Versley (2016), pseudo-projective, chart-based | 79.5 | |||||
| Corro et al. (2017),∗ bi-LSTM, Maximum Spanning Arborescence | 89.2 | |||||
| van Cranenburgh et al. (2016), DOP, | 87.0 | 74.8 | ||||
| Fernández-González and Martins (2015), dependency-based | 77.3 | |||||
| Gebhardt (2018), LCFRS with latent annotations | 75.1 | |||||
| Gold POS tags | ||||||
| Stanojević and Garrido Alhama (2017),∗ swap, stack/tree-LSTM | 81.6 | 82.9 | ||||
| Coavoux and Crabbé (2017a), sr-gap, perceptron | 81.6 | 49.2 | 82.2 | 50.0 | ||
| Maier (2015), swap, perceptron | 74.7 | 18.8 | 77.0 | 19.8 | ||
| Corro et al. (2017),∗ bi-LSTM, Maximum Spanning Arborescence | 90.1 | 81.6 | ||||
| Evang and Kallmeyer (2011), PLCFRS, | 79† | |||||
| Architecture hyperparameters | |
|---|---|
| Dimension of word embeddings | 32 |
| Dimension of character embeddings | 100 |
| Dimension of character bi-LSTM state | 50 for each direction |
| Dimension of sentence-level bi-LSTM | 200 for each direction |
| Dimension of hidden layers for the action scorer | 200 |
| Activation functions | for all hidden layers |
| Optimization hyperparameters | |
| Initial learning rate | |
| Learning rate decay | for step number |
| Dropout for tagger input | 0.5 |
| Dropout for parser input | 0.2 |
| Training epochs | 100 |
| Batch size | 1 sentence |
| Optimization algorithm | Averaged SGD Polyak and Juditsky (1992); Bottou (2010) |
| Word and character embedding initialization | |
| Other parameters initialization (including LSTMs) | Xavier Glorot and Bengio (2010) |
| Gradient clipping (norm) | 100 |
| Dynamic oracle | 0.15 |
| Parser | Setting | Tiger | DPTB | ||
|---|---|---|---|---|---|
| tok/s | sent/s | tok/s | sent/s | ||
| This work | Python, neural, greedy, CPU | 978 | 64 | 910 | 38 |
| MTG Coavoux et al. (2019) | C++, neural, greedy, CPU | 1934 | 126 | 1887 | 80 |
| MTG Coavoux and Crabbé (2017a) | C++, perceptron, beam=4, CPU | 4700 | 260 | ||
| rparse Maier (2015) | Java, perceptron, beam=8, CPU | 80 | |||
| rparse Maier (2015) | Java, perceptron, beam=1, CPU | 640 | |||
| Corro et al. (2017) | C++, neural, CPU | ||||
| All const. | Disc. const. | POS | ||||||
| Development sets | F | P | R | F | P | R | Acc. | |
| English (DPTB) | static | 91.1 | 91.1 | 91.2 | 68.2 | 75.3 | 62.3 | 97.2 |
| dynamic | 91.4 | 91.5 | 91.3 | 70.9 | 76.1 | 66.4 | 97.2 | |
| German (Tiger) | static | 87.4 | 87.8 | 87.0 | 61.7 | 64.4 | 59.2 | 98.3 |
| dynamic | 87.6 | 88.2 | 87.0 | 62.5 | 68.6 | 57.3 | 98.4 | |
| German (Negra) | static | 83.6 | 83.8 | 83.4 | 51.3 | 53.3 | 49.5 | 97.9 |
| dynamic | 84.0 | 84.7 | 83.4 | 54.0 | 58.1 | 50.5 | 98.0 | |
| Test sets | F | P | R | F | P | R | Acc. | |
| English (DPTB) | dynamic | 90.9 | 91.3 | 90.6 | 67.3 | 73.3 | 62.1 | 97.6 |
| German (Tiger) | dynamic | 82.5 | 83.5 | 81.5 | 55.9 | 62.4 | 50.6 | 98.0 |
| German (Negra) | dynamic | 83.2 | 83.8 | 82.6 | 56.3 | 64.9 | 49.8 | 98.0 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms
Discontinuous Constituency Parsing
with a Stack-Free Transition System and a Dynamic Oracle
Maximin Coavoux
Naver Labs Europe
&Shay B. Cohen
ILCC, School of Informatics
University of Edinburgh
[email protected] Work done at the University of Edinburgh.
Abstract
We introduce a novel transition system for discontinuous constituency parsing. Instead of storing subtrees in a stack –i.e. a data structure with linear-time sequential access– the proposed system uses a set of parsing items, with constant-time random access. This change makes it possible to construct any discontinuous constituency tree in exactly transitions for a sentence of length , whereas existing systems need a quadratic number of transitions to derive some structures. At each parsing step, the parser considers every item in the set to be combined with a focus item and to construct a new constituent in a bottom-up fashion. The parsing strategy is based on the assumption that most syntactic structures can be parsed incrementally and that the set –the memory of the parser– remains reasonably small on average. Moreover, we introduce a dynamic oracle for the new transition system, and present the first experiments in discontinuous constituency parsing using a dynamic oracle. Our parser obtains state-of-the-art results on three English and German discontinuous treebanks.
1 Introduction
Discontinuous constituency trees extend standard constituency trees by allowing crossing branches to represent long distance dependencies, such as the wh-extraction in Figure 1. Discontinuous constituency trees can be seen as derivations of Linear Context-Free Rewriting Systems (LCFRS, Vijay-Shanker et al., 1987), a class of formal grammars more expressive than context-free grammars, which makes them much harder to parse. In particular, exact CKY-style LCFRS parsing has an time complexity where is the fan-out of the grammar Kallmeyer (2010).
A natural alternative to grammar-based chart parsing is transition-based parsing, that usually relies on fast approximate decoding methods such as greedy search or beam search. Transition-based discontinuous parsers construct discontinuous constituents by reordering terminals with the swap action Versley (2014a, b); Maier (2015); Maier and Lichte (2016); Stanojević and Garrido Alhama (2017), or by using a split stack and the gap action to combine two non-adjacent constituents Coavoux and Crabbé (2017a); Coavoux et al. (2019). These proposals represent the memory of the parser (i.e. the tree fragments being constructed) with data structures with linear-time sequential access (either a stack, or a stack coupled with a double-ended queue). As a result, these systems need to perform at least actions to construct a new constituent from two subtrees separated by intervening subtrees. Our proposal aims at avoiding this cost when constructing discontinuous constituents.
We design a novel transition system in which a discontinuous constituent is constructed in a single step, without the use of reordering actions such as swap. The main innovation is that the memory of the parser is not represented by a stack, as is usual in shift-reduce systems, but by an unordered random-access set. The parser considers every constituent in the current memory to construct a new constituent in a bottom-up fashion, and thus instantly models interactions between parsing items that are not adjacent. As such, we describe a left-to-right parsing model that deviates from the standard stack-buffer setting, a legacy from pushdown automata and classical parsing algorithms for context-free grammars.
Our contributions are summarized as follows:
- •
We design a novel transition system for discontinuous constituency parsing, based on a memory represented by a set of items, and that derives any tree in exactly steps for a sentence of length ;
- •
we introduce the first dynamic oracle for discontinuous constituency parsing;
- •
we present an empirical evaluation of the transition system and dynamic oracle on two German and one English discontinuous treebanks.
The code of our parser is released as an open-source project at https://gitlab.com/mcoavoux/discoparset.
2 Set-based Transition System
System overview
We propose to represent the memory of the parser by (i) a set of parsing items and (ii) a single focus item. Figure 2 (lower part) illustrates a configuration in our system. The parser constructs a tree with two main actions: shift the next token to make it the new focus item (shift), or combine any item in the set with the focus item to make a new constituent bottom-up (combine action).
Since the memory is not an ordered data structure, the parser considers equally every pending parsing item, and thus constructs a discontinuous constituent in a single step, thereby making it able to construct any discontinuous tree in transitions.
The use of an unordered random-access data structure to represent the memory of the parser also leads to a major change for the scoring system (Figure 2). Stack-based systems use a local view of a parsing configuration to extract features and score actions: features only rely on the few topmost elements on the stack and buffer. The score of each transition depends on the totality of this local view. In constrast, we consider equally every item in the set, and therefore rely on a global view of the memory (Section 3). However, we score each possible combinations independently: the score of a single combination only depends on the two constituents that are combined, regardless of the rest of the set.
2.1 System Description
Definitions
We first define an instantiated (discontinuous) constituent as a nonterminal label associated with a set of token indexes . We call the left-index of and its right-index. For example in Figure 1, the two VPs are respectively represented by (VP, {1, 6}) and (VP, {1, 5, 6}), and they have the same right index (6) and left index (1).
A parsing configuration is a quadruple where:
- •
is a set of sets of indexes and represents the memory of the parser;
- •
is a set of indexes called the focus item, and satisfies ;
- •
is the index of the next token in the buffer;
- •
is a set of instantiated constituents.
Each new constituent is constructed bottom-up from the focus item and another item in the set .
Transition set
Our proposed transition system is based on the following types of actions:
- •
shift constructs a singleton containing the next token in the buffer and assigns it as the new focus item. The former focus item is added to .
- •
combine- computes the union between the focus item and another item from the set , to form the new focus item .
- •
label-X instantiates a new constituent whose yield is the set of indexes in the focus item .
- •
no-label has no effect; its semantics is that the current focus set is not a constituent.
Following Cross and Huang (2016b), transitions are divided into structural actions (shift, combine-) and labelling actions (label-X, no-label). The parser may only perform a structural action on an even step and a labelling action on an odd step. For our system, this distinction has the crucial advantage of keeping the number of possible actions low at each parsing step, compared to a system that would perform a combine action and a labelling action in a single reduce--X action.111In such a case, we would need to score actions, where is the set of nonterminals, instead of actions for our system.
Table 1 presents each action as a deduction rule associated with preconditions. In Table 2, we describe how to derive the tree from Figure 1.
2.2 Oracles
Training a transition-based parser requires an oracle, i.e. a function that determines what the best action is in a specific parsing configuration to serve as a training signal. We first describe a static oracle that provides a canonical derivation for a given gold tree. We then introduce a dynamic oracle that determines what the best action is in any parsing configuration.
2.2.1 Static Oracle
Our transition system exhibits a fair amount of spurious ambiguity, the ambiguity exhibited by the existence of many possible derivations for a single tree. Indeed, since we use an unordered memory, an -ary constituent (and more generally a tree) can be constructed by many different transition sequences. For example, the set {0, 1, 2} might be constructed by combining
- •
{0} and {1} first, and the result with {2}; or
- •
{1} and {2} first, and the result with {0}; or
- •
{0} and {2} first, and the result with {1}.
Following Cohen et al. (2012), we eliminate spurious ambiguity by selecting a canonical derivation for a gold tree. In particular, we design the static oracle (i) to apply combine as soon as possible in order to minimize the size of the memory (ii) to combine preferably with the most recent set in the memory when several combinations are possible. The first choice is motivated by properties of our system: when the memory is smaller, there are fewer choices, therefore decisions are simpler and less expensive to score.
2.2.2 Dynamic Oracle
Parsers are usually trained to predict the gold sequence of actions, using a static oracle. The limitation of this method is that the parser only sees a tiny portion of the search space at train time and only trains on gold input (i.e. configurations obtained after performing gold actions). At test time, it is in a different situation due to error propagation: it must predict what the best actions are in configurations from which the gold tree is probably no longer reachable.
To alleviate this limitation, Goldberg and Nivre (2012) proposed to train a parser with a dynamic oracle, an oracle that is defined for any parsing configuration and outputs the set of best actions to perform. In contrast, a static oracle is deterministic and is only defined for gold configurations.
Dynamic oracles were proposed for a wide range of dependency parsing transition systems (Goldberg and Nivre, 2013; Gómez-Rodríguez et al., 2014; Gómez-Rodríguez and Fernández-González, 2015), and later adapted to constituency parsing Coavoux and Crabbé (2016); Cross and Huang (2016b); Fernández-González and Gómez-Rodríguez (2018b, a).
In the remainder of this section, we introduce a dynamic oracle for our proposed transition system. It can be seen as an extension of the oracle of Cross and Huang (2016b) to the case of discontinuous parsing.
Preliminary definitions
For a parsing configuration , the relation holds iff can be derived from by a single transition. We note the reflexive and transitive closure of . An instantiated constituent is reachable from a configuration iff there exists such that and . Similarly, a set of constituents (possibly a full discontinuous constituency tree) is reachable iff there exists a configuration such that and . We note the set of constituents that are (i) in the gold set of constituents (ii) reachable from .
We define a total order on index sets:
[TABLE]
This order naturally extends to the constituents of a tree: iff . If precedes , then must be constructed before . Indeed, since the right-index of the focus item is non-decreasing during a derivation (as per the transition definitions), constituents are constructed in the order of their right-index (first condition). Moreover, since the algorithm is bottom-up, a constituent must be constructed before its parent (second condition).
From a configuration at an odd step, a constituent is reachable iff both the following properties hold:
; 2. 2.
.
Condition 1 is necessary because the parser can only construct new constituents such that . Condition 2 makes sure that can be constructed from a union of elements from , potentially augmented with terminals from the bufffer: .
Following Cross and Huang (2016b), we define as the smallest reachable gold constituent from a configuration . Formally:
[TABLE]
Oracle algorithm
We first define the oracle for the odd step of a configuration :
[TABLE]
For even steps, assuming , we define the oracle as follows:
[TABLE]
We provide a proof of the correctness of the oracle in Appendix A.
3 A Neural Network based on Constituent Boundaries
We first present an encoder that computes context-aware representations of tokens (Section 3.1). We then discuss how to compute the representation of a set of tokens (Section 3.2). We describe the action scorer (Section 3.3), the POS tagging component (Section 3.4), and the objective function (Section 3.5).
3.1 Token Representations
As in recent proposals in dependency and constituency parsing Cross and Huang (2016a); Kiperwasser and Goldberg (2016), our scoring system is based on a sentence transducer that constructs a context-aware representation for each token.
Given a sequence of tokens , we first run a single-layer character bi-LSTM encoder to obtain a character-aware embedding for each token. We represent a token as the concatenation of a standard word embedding and the character-aware embedding:
Then, we run a 2-layer bi-LSTM transducer over the sequence of token representations:
[TABLE]
The parser uses the context-aware token representations to construct vector representations of sets or constituents.
3.2 Set Representations
An open issue in neural discontinuous parsing is the representation of discontinuous constituents. In projective constituency parsing, it has become standard to use the boundaries of constituents Hall et al. (2014); Crabbé (2015); Durrett and Klein (2015), an approach that proved very successful with bi-LSTM token representations Cross and Huang (2016b); Stern et al. (2017).
Although constituent boundary features improves discontinuous parsing Coavoux and Crabbé (2017a), relying only on the left-index and the right-index of a constituent has the limitation of ignoring gaps inside a constituent. For example, since the two VPs in Figure 1 have the same right-index and left-index, they would have the same representations. It may also happen that constituents with identical right-index and left-index do not have the same labels.
We represent a (possibly partial) constituent with the yield , by computing 4 indexes from : . The set represents the gap in , i.e. the tokens between and that are not in the yield of :
[TABLE]
Finally, we extract the corresponding token representations of the 4 indexes and concatenate them to form the vector representation of :
[TABLE]
For an index set that does not contain a gap, we have . To handle this case, we use a parameter vector , randomly initialized and learned jointly with the network, to embed .
For example, the constituents (VP, {1, 6}) and (VP, {1, 5, 6}) will be respectively vectorized as:
[TABLE]
This representation method makes sure that two distinct index sets have distinct representations, as long as they have at most one gap each. This property no longer holds if one index sets has more than one gap.
3.3 Action Scorer
For each type of action –structural or labelling– we use a feedforward network with two hidden layers.
Structural actions
At structural steps, for a configuration , we need to compute the score of combine actions and possibly a shift action. In our approach, the score of a combine- action only depends on and and is independent of the rest of the configuration (i.e. other items in the set). We first construct input matrix as follows:
[TABLE]
Each of the first columns of matrix represents the input for a combine action, whereas the last column is the input for the shift action. We then compute the score of each structural action:
[TABLE]
where is a feedforward network with two hidden layers, a activation and a single output unit. In other words, it outputs a single scalar for each column vector of matrix . This part of the network can be seen as an attention mechanism, where the focus item is the query, and the context is formed by the items in the set and the first element in the buffer.
Labelling actions
We compute the probabilities of labelling actions as follows:
[TABLE]
where is a feedforward network with two hidden layers activated with the function, and output units, where is the set of nonterminals.
3.4 POS Tagger
Following Coavoux and Crabbé (2017b), we use the first layer of the bi-LSTM transducer as input to a Part-of-Speech (POS) tagger that is learned jointly with the parser. For a sentence , we compute the probability of a sequence of POS tags as follows:
[TABLE]
where and are parameters.
3.5 Objective Function
In the static oracle setting, for a single sentence , we optimize the sum of the log-likelihood of gold POS-tags and the log-likelihood of gold parsing actions :
[TABLE]
We optimize this objective by alternating a stochastic step for the tagging objective and a stochastic step for the parsing objective, as is standard in multitask learning Caruana (1997).
In the dynamic oracle setting, instead of optimizing the likelihood of the gold actions (assuming all previous actions were gold), we optimize the likelihood of the best actions, as computed by the dynamic oracle, from a configuration sampled from the space of all possible configurations. In practice, before each epoch, we sample each sentence from the training corpus with probability and we use the current (non-averaged) parameters to parse the sentence and generate a sequence of configurations. Instead of selecting the highest-scoring action at each parsing step, as in a normal inference step, we sample an action using the softmax distribution computed by the parser, as done by Ballesteros et al. (2016). Then, we use the dynamic oracle to calculate the best action from each of these configurations. In case there are several best actions, we deterministically choose a single action by favoring a combine over a shift (to bias the model towards a small memory), and to combine with the item with the highest right-index (to avoid spurious discontinuity in partial constituents). We train the parser on these sequences of potentially non-gold configuration-action pairs.
4 Experiments
We carried out experiments to assess the adequacy of our system and the effect of training with the dynamic oracle. We present the three discontinuous constituency treebanks that we used (Section 4.1), our experimental protocol (Section 4.2), then we discuss the results (Section 4.3) and the efficiency of the parser (Section 4.4).
4.1 Datasets
We perform experiments on three discontinuous constituency corpora. The discontinuous Penn Treebank was introduced by Evang and Kallmeyer (2011) who converted the long distance dependencies encoded by indexed traces in the original Penn treebank Marcus et al. (1993) to discontinuous constituents. We used the standard split (sections 2-21 for training, 22 for development and 23 for test). The Tiger corpus Brants et al. (2004) and the Negra corpus Skut et al. (1997) are both German treebanks natively annotated with discontinuous constituents. We used the SPMRL split for the Tiger corpus Seddah et al. (2013), and the split of Dubey and Keller (2003) for the Negra corpus.
4.2 Implementation and Protocol
We implemented our parser in Python using the Pytorch library Paszke et al. (2017). We trained each model with the ASGD algorithm Polyak and Juditsky (1992) for 100 epochs. Training a single model takes approximately a week with a GPU. We evaluate a model every 4 epochs on the validation set and select the best performing model according to the validation F-score. We refer the reader to Table 5 of Appendix B for the full list of hyperparameters.
We evaluate models with the dedicated module of discodop222https://github.com/andreasvc/disco-dop van Cranenburgh et al. (2016). We use the standard evaluation parameters (proper.prm), that ignore punctuations and root symbols. We report two evaluation metrics: a standard Fscore (F) and an Fscore computed only on discontinuous constituents (Disc. F), which provides a more qualitative evaluation of the ability of the parser to recover long distance dependencies.
4.3 Results
Effect of Dynamic Oracle
We present parsing results on the development sets of each corpus in Table 3. The effect of the oracle is in line with other published results in projective constituency parsing Coavoux and Crabbé (2016); Cross and Huang (2016b) and dependency parsing Goldberg and Nivre (2012); Gómez-Rodríguez et al. (2014): the dynamic oracle improves the generalization capability of the parser.
External comparisons
In Table 4, we compare our parser to other transition-based parsers Maier (2015); Coavoux and Crabbé (2017a); Stanojević and Garrido Alhama (2017); Coavoux et al. (2019), the pseudo-projective parser of Versley (2016), grammar-based chart parsers Evang and Kallmeyer (2011); van Cranenburgh et al. (2016); Gebhardt (2018) and parsers based on dependency parsing Fernández-González and Martins (2015); Corro et al. (2017). Note that some of them only report results in a gold POS tag setting (the parser has access to gold POS tags and use them as features), a setting that is much easier than ours.
Our parser matches the state of the art of Coavoux et al. (2019). This promising result shows that it is feasible to design accurate transition systems without an ordered memory.
4.4 Efficiency
Our transition system derives a tree for a sentence of words in exactly transitions. Indeed, there must be shift actions, and combine actions. Each of these transitions must be followed by a single labelling action.
The statistical model responsible for choosing which action to perform at each parsing step needs to score actions for a structural step and actions for a labelling step (where is the set of possible nonterminals). Since in the worst case, contains singletons, the parser has an time complexity.
In practice, the memory of the parser remains relatively small on average. We report in Figure 3 the distribution of the size of across configurations when parsing the development sets of three corpora. For the German treebanks, the memory contains 7 or fewer elements for more than 99 percents of configurations. For the Penn treebank, the memory is slighlty larger, with 98 percents of configuration with 11 or fewer items.
We report empirical runtimes in Table 6 of Appendix C. Our parser compares decently with other transition-based parsers, despite being written in Python.
5 Related Work
Existing transition systems for discontinuous constituency parsing rely on three main strategies for constructing discontinuous constituents: a swap-based strategy, a split-stack strategy, and the use of non-local transitions.
Swap-based systems
Swap-based transition systems are based on the idea that any discontinuous constituency tree can be transformed into a projective tree by reordering terminals. They reorder terminals by swapping them with a dedicated action (swap), commonly used in dependency parsing Nivre (2009). The first proposals in transition-based discontinuous constituency parsing used the swap action on top of an easy-first parser Versley (2014a, b). Subsequent proposals relied on a shift-reduce system Maier (2015); Maier and Lichte (2016) or a shift-promote-adjoin system Stanojević and Garrido Alhama (2017).
The main limitation of swap-based system is that they tend to require a large number of transitions to derive certain trees. The choice of an oracle that minimizes derivation lengths has a substantially positive effect on parsing Maier and Lichte (2016); Stanojević and Garrido Alhama (2017).
Split-stack systems
The second parsing strategy constructs discontinuous constituents by allowing the parser to reduce pairs of items that are not adjacent in the stack. In practice, Coavoux and Crabbé (2017a) split the usual stack of shift-reduce parsers into two data structures (a stack and a double-ended queue), in order to give the parser access to two focus items: the respective tops of the stack and the dequeue, that may or may not be adjacent. A dedicated action, gap, pushes the top of the stack onto the bottom of the queue to make the next item in the stack available for a reduction.
The split stack associated with the gap action can be interpreted as a linear-access memory: it is possible to access the element in the stack, but it requires operations.
Non-local transitions
Non-local transitions generalize standard parsing actions to non-adjacent elements in the parsing configurations. Maier and Lichte (2016) introduced a non-local transition SkipShift- which applies shift to the element in the buffer. Non-local transitions are also widely used in non-projective dependency parsing Attardi (2006); Qi and Manning (2017); Fernández-González and Gómez-Rodríguez (2018).
The key difference between these systems and ours is that we use an unordered memory. As a result, the semantics of the combine- action we introduce in Section 2 is independent from a specific position in the stack or the buffer. A system with an action such as SkipShift- needs to learn parameters with every possible , and will only learn parameters with the SkipShift- actions that are required to derive the training set. In contrast, we use the same parameters to score each possible combine- action.
6 Conclusion
We have presented a novel transition system that dispenses with the use of a stack, i.e. a memory with linear sequential access. Instead, the memory of the parser is represented by an unordered data structure with random-access: a set. We have designed a dynamic oracle for the resulting system and shown their empirical potential with state-of-the-art results on discontinuous constituency parsing of one English and two German treebanks. Finally, we plan to adapt our system to non-projective dependency parsing and semantic graph parsing.
Acknowledgments
We thank Caio Corro, Giorgio Satta, Marco Damonte, as well as NAACL anonymous reviewers for feedback and suggestions. We gratefully acknowledge the support of Huawei Technologies.
Appendix A Oracle Correctness
The oracle leads to the reachable tree with the highest F-score with respect to the gold tree. The F-score of a predicted tree (represented as a set of instantiated constituents) with respect to a gold tree is defined as:
[TABLE]
By definition, is optimal for precision because it constructs a constituent only if it is gold, and optimal for recall because it will construct a gold constituent if it is possible to do so.
Moreover, is optimal for recall because any gold constituent reachable from will still be reachable after any transition in . Assuming a configuration and , we consider separately the shift case and the combine- case:
- •
shift case (): constituents reachable from and not reachable from shift() satisfy . If a gold constituent satisfies this property, we have , which contradicts the assumption that (see definition of oracle in Section 2.2.2).
- •
combine- case: Let be a reachable gold constituent. Since it is compatible with , there are three possible cases:
- –
if is an ascendant of , then , therefore is still reachable from combine-().
- –
if is a descendant of then , which contradicts the definition of .
- –
if and are completely disjoint, we have , therefore , and is still reachable from combine-().
Finally, since does not construct new constituents (it is the role of labelling actions), it is optimal for precision.
Appendix B Hyperparameters
The list of hyperparameters is presented in Table 5.
- •
We use learning rate warm-up (linear increase from 0 to during the first 1000 steps).
- •
Before the update, we add Gaussian noise to the gradient of every parameter with mean 0 and variance Neelakantan et al. (2015).
- •
All experiments use greedy search decoding (we did not experiment with beam search).
- •
Before each training step, we replace a word embedding by an ‘UNK’ pseudo-word embedding with probability . We only do this replacement for the least frequent word-types ( least frequent word-types). The ‘UNK’ embedding is then used to represent unknown words.
- •
We apply dropout at the input of the tagger and the input of action scorers: each single prediction has its own dropout mask.
Appendix C Detailed Results
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Attardi (2006) Giuseppe Attardi. 2006. Experiments with a multilanguage non-projective dependency parser . In Proceedings of the Tenth Conference on Computational Natural Language Learning (Co NLL-X) , pages 166–170. Association for Computational Linguistics.
- 2Ballesteros et al. (2016) Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Noah A. Smith. 2016. Training with exploration improves a greedy stack lstm parser . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 2005–2010, Austin, Texas. Association for Computational Linguistics. · doi ↗
- 3Bottou (2010) Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent . In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010) , pages 177–187, Paris, France. Springer.
- 4Brants et al. (2004) Sabine Brants, Stefanie Dipper, Peter Eisenberg, Silvia Hansen-Schirra, Esther König, Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit. 2004. Tiger: Linguistic interpretation of a german corpus . Research on Language and Computation , 2(4):597–620. · doi ↗
- 5Caruana (1997) Rich Caruana. 1997. Multitask learning . Machine Learning , 28(1):41–75. · doi ↗
- 6Coavoux and Crabbé (2016) Maximin Coavoux and Benoit Crabbé. 2016. Neural greedy constituent parsing with dynamic oracles . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 172–182, Berlin, Germany. Association for Computational Linguistics.
- 7Coavoux and Crabbé (2017 a) Maximin Coavoux and Benoit Crabbé. 2017 a. Incremental discontinuous phrase structure parsing with the gap transition . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , pages 1259–1270, Valencia, Spain. Association for Computational Linguistics.
- 8Coavoux and Crabbé (2017 b) Maximin Coavoux and Benoit Crabbé. 2017 b. Multilingual lexicalized constituency parsing with word-level auxiliary tasks . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 331–336, Valencia, Spain. Association for Computational Linguistics.
