Discontinuous Constituency Parsing with a Stack-Free Transition System   and a Dynamic Oracle

Maximin Coavoux; Shay B. Cohen

arXiv:1904.00615·cs.CL·April 2, 2019

Discontinuous Constituency Parsing with a Stack-Free Transition System and a Dynamic Oracle

Maximin Coavoux, Shay B. Cohen

PDF

Open Access 1 Repo

TL;DR

This paper presents a new stack-free transition system for discontinuous constituency parsing using a set of parsing items, enabling efficient construction of trees and introducing a dynamic oracle, achieving state-of-the-art results.

Contribution

It introduces a novel transition system with constant-time access and a dynamic oracle for discontinuous constituency parsing, improving efficiency and accuracy.

Findings

01

Achieves state-of-the-art results on English and German treebanks.

02

Constructs any discontinuous tree in exactly 4n - 2 transitions.

03

Introduces the first dynamic oracle for this parsing task.

Abstract

We introduce a novel transition system for discontinuous constituency parsing. Instead of storing subtrees in a stack --i.e. a data structure with linear-time sequential access-- the proposed system uses a set of parsing items, with constant-time random access. This change makes it possible to construct any discontinuous constituency tree in exactly $4 n - 2$ transitions for a sentence of length $n$ . At each parsing step, the parser considers every item in the set to be combined with a focus item and to construct a new constituent in a bottom-up fashion. The parsing strategy is based on the assumption that most syntactic structures can be parsed incrementally and that the set --the memory of the parser-- remains reasonably small on average. Moreover, we introduce a provably correct dynamic oracle for the new transition system, and present the first experiments in discontinuous…

Tables7

Table 1. Table 1: Set-based transition system description. Variable j 𝑗 j is the number of steps performed since the start of the derivation.

Initial configuration	$(\emptyset, null, 0, \emptyset) : 0$
Goal configuration	$(\emptyset, {0, 1, \dots, n - 1}, n, C) : 4 n - 2$
Structural actions	Input	Output	Precondition
shift	$(S, s_{f}, i, C) : j$	$\Rightarrow (S \cup {s_{f}}, {i}, i + 1, C) : j + 1$	$i < n$ , $j$ is even
combine- $s$	$(S, s_{f}, i, C) : j$	$\Rightarrow (S - s, s_{f} \cup s, i, C) : j + 1$	$s \in S$ , $j$ is even
Labelling actions
label-X	$(S, s_{f}, i, C) : j$	$\Rightarrow (S, s_{f}, i, C \cup {(X, s_{f})}) : j + 1$	$j$ is odd
no-label	$(S, s_{f}, i, C) : j$	$\Rightarrow (S, s_{f}, i, C) : j + 1$	$i \neq n$ or $S \neq \emptyset$ , $j$ is odd

Table 2. Table 2: Full derivation for the sentence in Figure 1 . As a convention, we index elements in the set with their left-index and use comb - i 𝑖 i to denote comb - s i subscript 𝑠 𝑖 s_{i} . We also use tokens instead of their indexes for better legibility.

Even action	Set ( $S$ )	Focus ( $s_{f}$ )	Buffer	Odd action
	{}	none	So what ’s a parent to do ?
$\Rightarrow$ sh $\Rightarrow$	{}	{So}	what ’s a parent to do ?	$\Rightarrow$ no-label
$\Rightarrow$ sh $\Rightarrow$	{{So}₀}	{what}	’s a parent to do ?	$\Rightarrow$ label-WHNP
$\Rightarrow$ sh $\Rightarrow$	{{So}₀, {what}₁}	{’s}	a parent to do ?	$\Rightarrow$ no-label
$\Rightarrow$ sh $\Rightarrow$	{{So}₀, {what}₁, {’s}₂}	{a}	parent to do ?	$\Rightarrow$ no-label
$\Rightarrow$ sh $\Rightarrow$	{{So}₀, {what}₁, {’s}₂, {a}₃}	{parent}	to do ?	$\Rightarrow$ no-label
$\Rightarrow$ comb-3 $\Rightarrow$	{{So}₀, {what}₁, {’s}₂}	{a parent}	to do ?	$\Rightarrow$ label-NP
$\Rightarrow$ comb-2 $\Rightarrow$	{{So}₀, {what}₁}	{’s a parent}	to do ?	$\Rightarrow$ no-label
$\Rightarrow$ sh $\Rightarrow$	{{So}₀, {what}₁, {’s a parent}₂}	{to}	do ?	$\Rightarrow$ no-label
$\Rightarrow$ sh $\Rightarrow$	{{So}₀, {what}₁, {’s a parent}₂, {to}₅}	{do}	?	$\Rightarrow$ no-label
$\Rightarrow$ comb-1 $\Rightarrow$	{{So}₀, {’s a parent}₂, {to}₅}	{what do}	?	$\Rightarrow$ label-VP
$\Rightarrow$ comb-5 $\Rightarrow$	{{So}₀, {’s a parent}₂}	{what to do}	?	$\Rightarrow$ label-VP
$\Rightarrow$ comb-2 $\Rightarrow$	{{So}₀}	{what ’s a parent to do}	?	$\Rightarrow$ label-SQ
$\Rightarrow$ comb-0 $\Rightarrow$	{}	{so what ’s a parent to do}	?	$\Rightarrow$ no-label
$\Rightarrow$ sh $\Rightarrow$	{{So what ’s a parent to do}₀}	{?}		$\Rightarrow$ no-label
$\Rightarrow$ comb-0 $\Rightarrow$	{}	{So what ’s a parent to do ?}		$\Rightarrow$ label-SBARQ

Table 3. Table 3: Results on development corpora. F1 is the Fscore on all constituents, Disc. F1 is an Fscore computed only on discontinuous constituents, POS is the accuracy on part-of-speech tags. Detailed results (including precision and recall) are given in Table 7 of Appendix C .

	DPTB			Tiger			Negra
	F1	Disc. F1	POS	F1	Disc. F1	POS	F1	Disc. F1	POS
static	91.1	68.2	97.2	87.4	61.7	98.3	83.6	51.3	97.9
dynamic	91.4	70.9	97.2	87.6	62.5	98.4	84.0	54.0	98.0

Table 4. Table 4: Discontinuous parsing results on the test sets. ∗ Neural scoring system. † Does not discount root symbols and punctuation.

Predicted POS tags or own tagging
	English (DPTB)		German (Tiger)		German (Negra)
Model	F	Disc. F	F	Disc. F	F	Disc. F
This work, dynamic oracle	90.9	67.3	82.5	55.9	83.2	56.3
Coavoux et al. (2019),^∗ gap, bi-LSTM	91.0	71.3	82.7	55.9	83.2	54.6
Stanojević and Garrido Alhama (2017),^∗ swap, stack/tree-LSTM			77.0
Coavoux and Crabbé (2017a), sr-gap, perceptron			79.3
Versley (2016), pseudo-projective, chart-based			79.5
Corro et al. (2017),^∗ bi-LSTM, Maximum Spanning Arborescence	89.2
van Cranenburgh et al. (2016), DOP, $\leq 40$	87.0				74.8
Fernández-González and Martins (2015), dependency-based			77.3
Gebhardt (2018), LCFRS with latent annotations			75.1
Gold POS tags
Stanojević and Garrido Alhama (2017),^∗ swap, stack/tree-LSTM			81.6		82.9
Coavoux and Crabbé (2017a), sr-gap, perceptron			81.6	49.2	82.2	50.0
Maier (2015), swap, perceptron			74.7	18.8	77.0	19.8
Corro et al. (2017),^∗ bi-LSTM, Maximum Spanning Arborescence	90.1		81.6
Evang and Kallmeyer (2011), PLCFRS, $< 25$	79^†

Table 5. Table 5: Hyperparameters of the model.

Architecture hyperparameters
Dimension of word embeddings	32
Dimension of character embeddings	100
Dimension of character bi-LSTM state	50 for each direction
Dimension of sentence-level bi-LSTM	200 for each direction
Dimension of hidden layers for the action scorer	200
Activation functions	$\tanh$ for all hidden layers
Optimization hyperparameters
Initial learning rate	$l_{0} = 0.01$
Learning rate decay	$l_{t} = \frac{l_{0}}{1 + t \cdot 10^{- 7}}$ for step number $t$
Dropout for tagger input	0.5
Dropout for parser input	0.2
Training epochs	100
Batch size	1 sentence
Optimization algorithm	Averaged SGD Polyak and Juditsky (1992); Bottou (2010)
Word and character embedding initialization	$𝒰 ([- 0.1, 0.1])$
Other parameters initialization (including LSTMs)	Xavier Glorot and Bengio (2010)
Gradient clipping (norm)	100
Dynamic oracle $p$	0.15

Table 6. Table 6: Parsing times on development sets in tokens per second (tok/s) and sentences per second (sent/s). The parsing times are presented as reported by authors, they are not comparable across parsers (since the experiments were run on different hardware). Our parser is run on a single core of an Intel i7 CPU.

Parser	Setting	Tiger		DPTB
		tok/s	sent/s	tok/s	sent/s
This work	Python, neural, greedy, CPU	978	64	910	38
MTG Coavoux et al. (2019)	C++, neural, greedy, CPU	1934	126	1887	80
MTG Coavoux and Crabbé (2017a)	C++, perceptron, beam=4, CPU	4700	260
rparse Maier (2015)	Java, perceptron, beam=8, CPU		80
rparse Maier (2015)	Java, perceptron, beam=1, CPU		640
Corro et al. (2017)	C++, neural, CPU				$\approx 7.3$

Table 7. Table 7: Detailed results. Overall, the positive effect of the dynamic oracle on Fscore is explained by its effect on precision.

		All const.			Disc. const.			POS
Development sets		F	P	R	F	P	R	Acc.
English (DPTB)	static	91.1	91.1	91.2	68.2	75.3	62.3	97.2
	dynamic	91.4	91.5	91.3	70.9	76.1	66.4	97.2
German (Tiger)	static	87.4	87.8	87.0	61.7	64.4	59.2	98.3
	dynamic	87.6	88.2	87.0	62.5	68.6	57.3	98.4
German (Negra)	static	83.6	83.8	83.4	51.3	53.3	49.5	97.9
	dynamic	84.0	84.7	83.4	54.0	58.1	50.5	98.0
Test sets		F	P	R	F	P	R	Acc.
English (DPTB)	dynamic	90.9	91.3	90.6	67.3	73.3	62.1	97.6
German (Tiger)	dynamic	82.5	83.5	81.5	55.9	62.4	50.6	98.0
German (Negra)	dynamic	83.2	83.8	82.6	56.3	64.9	49.8	98.0

Equations34

s\preceq s^{\prime}\Leftrightarrow\left\{\begin{array}[]{l}\max(s)<\max(s^{\prime}),\\ \textbf{or}\\ \max(s)=\max(s^{\prime})\\ \text{and }s\subseteq s^{\prime}.\end{array}\right.

s\preceq s^{\prime}\Leftrightarrow\left\{\begin{array}[]{l}\max(s)<\max(s^{\prime}),\\ \textbf{or}\\ \max(s)=\max(s^{\prime})\\ \text{and }s\subseteq s^{\prime}.\end{array}\right.

next (c, t^{*}) = ⪯ argmin reach (c, t^{*}) .

next (c, t^{*}) = ⪯ argmin reach (c, t^{*}) .

\mathsf{o_{odd}}(c,t^{*})=\left\{\begin{array}[]{ll}\textsc{\{label-X\}}&\text{if }\exists(X,s_{f})\in t^{*},\\ \textsc{\{no-label\}}&\text{otherwise.}\end{array}\right.

\mathsf{o_{odd}}(c,t^{*})=\left\{\begin{array}[]{ll}\textsc{\{label-X\}}&\text{if }\exists(X,s_{f})\in t^{*},\\ \textsc{\{no-label\}}&\text{otherwise.}\end{array}\right.

\displaystyle\mathsf{o_{even}}(c,t^{*})=\left\{\begin{array}[]{r}\lx@intercol\{\textsc{comb-$s$}|(s_{f}\cup s)\subseteq s_{g}\}\hfil\lx@intercol\\ \text{if }\max(s_{g})=\max(s_{f}),\\ \{\textsc{comb-$s$}|(s_{f}\cup s)\subseteq s_{g}\}\cup\{\textsc{sh}\}\\ \text{if }\max(s_{g})>\max(s_{f}).\end{array}\right.

\displaystyle\mathsf{o_{even}}(c,t^{*})=\left\{\begin{array}[]{r}\lx@intercol\{\textsc{comb-$s$}|(s_{f}\cup s)\subseteq s_{g}\}\hfil\lx@intercol\\ \text{if }\max(s_{g})=\max(s_{f}),\\ \{\textsc{comb-$s$}|(s_{f}\cup s)\subseteq s_{g}\}\cup\{\textsc{sh}\}\\ \text{if }\max(s_{g})>\max(s_{f}).\end{array}\right.

(h_{1}^{(1)}, \dots, h_{n}^{(1)})

(h_{1}^{(1)}, \dots, h_{n}^{(1)})

(h_{1}^{(2)}, \dots, h_{n}^{(2)})

\overline{s} = {i ∣ min (s) < i < max (s) and i \in / s} .

\overline{s} = {i ∣ min (s) < i < max (s) and i \in / s} .

r (s) = [h_{m i n (s)}^{(2)}; h_{m a x (s)}^{(2)}; h_{m i n (\overline{s})}^{(2)}; h_{m a x (\overline{s})}^{(2)}] .

r (s) = [h_{m i n (s)}^{(2)}; h_{m a x (s)}^{(2)}; h_{m i n (\overline{s})}^{(2)}; h_{m a x (\overline{s})}^{(2)}] .

r ({1, 6})

r ({1, 6})

r ({1, 5, 6})

M

M

P (\cdot ∣ c)

P (\cdot ∣ c)

P (\cdot ∣ s_{f})

P (\cdot ∣ s_{f})

P (t_{1}^{n} ∣ x_{1}^{n}) = i = 1 \prod n Softmax (W^{(t)} \cdot h_{i}^{(1)} + b^{(t)})_{t_{i}},

P (t_{1}^{n} ∣ x_{1}^{n}) = i = 1 \prod n Softmax (W^{(t)} \cdot h_{i}^{(1)} + b^{(t)})_{t_{i}},

L

L

L_{t}

L_{p}

precision (\hat{t}, t^{*})

precision (\hat{t}, t^{*})

recall (\hat{t}, t^{*})

F_{1} (\hat{t}, t^{*})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/mcoavoux/discoparset
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms

Full text

Discontinuous Constituency Parsing

with a Stack-Free Transition System and a Dynamic Oracle

Maximin Coavoux

Naver Labs Europe

[email protected]

&Shay B. Cohen

ILCC, School of Informatics

University of Edinburgh

[email protected] Work done at the University of Edinburgh.

Abstract

We introduce a novel transition system for discontinuous constituency parsing. Instead of storing subtrees in a stack –i.e. a data structure with linear-time sequential access– the proposed system uses a set of parsing items, with constant-time random access. This change makes it possible to construct any discontinuous constituency tree in exactly $4n-2$ transitions for a sentence of length $n$ , whereas existing systems need a quadratic number of transitions to derive some structures. At each parsing step, the parser considers every item in the set to be combined with a focus item and to construct a new constituent in a bottom-up fashion. The parsing strategy is based on the assumption that most syntactic structures can be parsed incrementally and that the set –the memory of the parser– remains reasonably small on average. Moreover, we introduce a dynamic oracle for the new transition system, and present the first experiments in discontinuous constituency parsing using a dynamic oracle. Our parser obtains state-of-the-art results on three English and German discontinuous treebanks.

1 Introduction

Discontinuous constituency trees extend standard constituency trees by allowing crossing branches to represent long distance dependencies, such as the wh-extraction in Figure 1. Discontinuous constituency trees can be seen as derivations of Linear Context-Free Rewriting Systems (LCFRS, Vijay-Shanker et al., 1987), a class of formal grammars more expressive than context-free grammars, which makes them much harder to parse. In particular, exact CKY-style LCFRS parsing has an $\mathcal{O}(n^{3f})$ time complexity where $f$ is the fan-out of the grammar Kallmeyer (2010).

A natural alternative to grammar-based chart parsing is transition-based parsing, that usually relies on fast approximate decoding methods such as greedy search or beam search. Transition-based discontinuous parsers construct discontinuous constituents by reordering terminals with the swap action Versley (2014a, b); Maier (2015); Maier and Lichte (2016); Stanojević and Garrido Alhama (2017), or by using a split stack and the gap action to combine two non-adjacent constituents Coavoux and Crabbé (2017a); Coavoux et al. (2019). These proposals represent the memory of the parser (i.e. the tree fragments being constructed) with data structures with linear-time sequential access (either a stack, or a stack coupled with a double-ended queue). As a result, these systems need to perform at least $n$ actions to construct a new constituent from two subtrees separated by $n$ intervening subtrees. Our proposal aims at avoiding this cost when constructing discontinuous constituents.

We design a novel transition system in which a discontinuous constituent is constructed in a single step, without the use of reordering actions such as swap. The main innovation is that the memory of the parser is not represented by a stack, as is usual in shift-reduce systems, but by an unordered random-access set. The parser considers every constituent in the current memory to construct a new constituent in a bottom-up fashion, and thus instantly models interactions between parsing items that are not adjacent. As such, we describe a left-to-right parsing model that deviates from the standard stack-buffer setting, a legacy from pushdown automata and classical parsing algorithms for context-free grammars.

Our contributions are summarized as follows:

•

We design a novel transition system for discontinuous constituency parsing, based on a memory represented by a set of items, and that derives any tree in exactly $4n-2$ steps for a sentence of length $n$ ;

•

we introduce the first dynamic oracle for discontinuous constituency parsing;

•

we present an empirical evaluation of the transition system and dynamic oracle on two German and one English discontinuous treebanks.

The code of our parser is released as an open-source project at https://gitlab.com/mcoavoux/discoparset.

2 Set-based Transition System

System overview

We propose to represent the memory of the parser by (i) a set of parsing items and (ii) a single focus item. Figure 2 (lower part) illustrates a configuration in our system. The parser constructs a tree with two main actions: shift the next token to make it the new focus item (shift), or combine any item in the set with the focus item to make a new constituent bottom-up (combine action).

Since the memory is not an ordered data structure, the parser considers equally every pending parsing item, and thus constructs a discontinuous constituent in a single step, thereby making it able to construct any discontinuous tree in $\mathcal{O}(n)$ transitions.

The use of an unordered random-access data structure to represent the memory of the parser also leads to a major change for the scoring system (Figure 2). Stack-based systems use a local view of a parsing configuration to extract features and score actions: features only rely on the few topmost elements on the stack and buffer. The score of each transition depends on the totality of this local view. In constrast, we consider equally every item in the set, and therefore rely on a global view of the memory (Section 3). However, we score each possible combinations independently: the score of a single combination only depends on the two constituents that are combined, regardless of the rest of the set.

2.1 System Description

Definitions

We first define an instantiated (discontinuous) constituent $(X,s)$ as a nonterminal label $X$ associated with a set of token indexes $s$ . We call $\min(s)$ the left-index of $c$ and $\max(s)$ its right-index. For example in Figure 1, the two VPs are respectively represented by (VP, {1, 6}) and (VP, {1, 5, 6}), and they have the same right index (6) and left index (1).

A parsing configuration is a quadruple $(S,s_{f},i,C)$ where:

•

$S$ is a set of sets of indexes and represents the memory of the parser;

•

$s_{f}$ is a set of indexes called the focus item, and satisfies $\max(s_{f})=i-1$ ;

•

$i$ is the index of the next token in the buffer;

•

$C$ is a set of instantiated constituents.

Each new constituent is constructed bottom-up from the focus item and another item in the set $S$ .

Transition set

Our proposed transition system is based on the following types of actions:

•

shift constructs a singleton containing the next token in the buffer and assigns it as the new focus item. The former focus item is added to $S$ .

•

combine- $s$ computes the union between the focus item $s_{f}$ and another item $s$ from the set $S$ , to form the new focus item $s\cup s_{f}$ .

•

label-X instantiates a new constituent $(X,s_{f})$ whose yield is the set of indexes in the focus item $s_{f}$ .

•

no-label has no effect; its semantics is that the current focus set is not a constituent.

Following Cross and Huang (2016b), transitions are divided into structural actions (shift, combine- $s$ ) and labelling actions (label-X, no-label). The parser may only perform a structural action on an even step and a labelling action on an odd step. For our system, this distinction has the crucial advantage of keeping the number of possible actions low at each parsing step, compared to a system that would perform a combine action and a labelling action in a single reduce- $s$ -X action.111In such a case, we would need to score $|S|\times|N|+1$ actions, where $N$ is the set of nonterminals, instead of $|S|+1$ actions for our system.

Table 1 presents each action as a deduction rule associated with preconditions. In Table 2, we describe how to derive the tree from Figure 1.

2.2 Oracles

Training a transition-based parser requires an oracle, i.e. a function that determines what the best action is in a specific parsing configuration to serve as a training signal. We first describe a static oracle that provides a canonical derivation for a given gold tree. We then introduce a dynamic oracle that determines what the best action is in any parsing configuration.

2.2.1 Static Oracle

Our transition system exhibits a fair amount of spurious ambiguity, the ambiguity exhibited by the existence of many possible derivations for a single tree. Indeed, since we use an unordered memory, an $n$ -ary constituent (and more generally a tree) can be constructed by many different transition sequences. For example, the set {0, 1, 2} might be constructed by combining

•

{0} and {1} first, and the result with {2}; or

•

{1} and {2} first, and the result with {0}; or

•

{0} and {2} first, and the result with {1}.

Following Cohen et al. (2012), we eliminate spurious ambiguity by selecting a canonical derivation for a gold tree. In particular, we design the static oracle (i) to apply combine as soon as possible in order to minimize the size of the memory (ii) to combine preferably with the most recent set in the memory when several combinations are possible. The first choice is motivated by properties of our system: when the memory is smaller, there are fewer choices, therefore decisions are simpler and less expensive to score.

2.2.2 Dynamic Oracle

Parsers are usually trained to predict the gold sequence of actions, using a static oracle. The limitation of this method is that the parser only sees a tiny portion of the search space at train time and only trains on gold input (i.e. configurations obtained after performing gold actions). At test time, it is in a different situation due to error propagation: it must predict what the best actions are in configurations from which the gold tree is probably no longer reachable.

To alleviate this limitation, Goldberg and Nivre (2012) proposed to train a parser with a dynamic oracle, an oracle that is defined for any parsing configuration and outputs the set of best actions to perform. In contrast, a static oracle is deterministic and is only defined for gold configurations.

Dynamic oracles were proposed for a wide range of dependency parsing transition systems (Goldberg and Nivre, 2013; Gómez-Rodríguez et al., 2014; Gómez-Rodríguez and Fernández-González, 2015), and later adapted to constituency parsing Coavoux and Crabbé (2016); Cross and Huang (2016b); Fernández-González and Gómez-Rodríguez (2018b, a).

In the remainder of this section, we introduce a dynamic oracle for our proposed transition system. It can be seen as an extension of the oracle of Cross and Huang (2016b) to the case of discontinuous parsing.

Preliminary definitions

For a parsing configuration $c$ , the relation $c\vdash c^{\prime}$ holds iff $c^{\prime}$ can be derived from $c$ by a single transition. We note $\vdash^{*}$ the reflexive and transitive closure of $\vdash$ . An instantiated constituent $(X,s)$ is reachable from a configuration $c=(S,s_{f},i,C)$ iff there exists $c^{\prime}=(S^{\prime},s^{\prime}_{f},i^{\prime},C^{\prime})$ such that $(X,s)\in C^{\prime}$ and $c\vdash^{*}c^{\prime}$ . Similarly, a set of constituents $t$ (possibly a full discontinuous constituency tree) is reachable iff there exists a configuration $c^{\prime}=(S^{\prime},s^{\prime}_{f},i^{\prime},C^{\prime})$ such that $t\subseteq C^{\prime}$ and $c\vdash^{*}c^{\prime}$ . We note $\mathsf{reach}(c,t^{*})$ the set of constituents that are (i) in the gold set of constituents $t^{*}$ (ii) reachable from $c$ .

We define a total order $\preceq$ on index sets:

[TABLE]

This order naturally extends to the constituents of a tree: $(X,s)\preceq(X^{\prime},s^{\prime})$ iff $s\preceq s^{\prime}$ . If $(X,s)$ precedes $(X^{\prime},s^{\prime})$ , then $(X,s)$ must be constructed before $(X^{\prime},s^{\prime})$ . Indeed, since the right-index of the focus item is non-decreasing during a derivation (as per the transition definitions), constituents are constructed in the order of their right-index (first condition). Moreover, since the algorithm is bottom-up, a constituent must be constructed before its parent (second condition).

From a configuration $c=(S,s_{f},i,C)$ at an odd step, a constituent $(X,s_{g})\notin C$ is reachable iff both the following properties hold:

$\max(s_{f})\leq\max(s_{g})$ ; 2. 2.

$\forall s\in S\cup\{s_{f}\},(s\subseteq s_{g})\text{ or }(s\cap s_{g}=\emptyset)$ .

Condition 1 is necessary because the parser can only construct new constituents $(X,s)$ such that $s_{f}\preceq s$ . Condition 2 makes sure that $s_{g}$ can be constructed from a union of elements from $S\cup\{s_{f}\}$ , potentially augmented with terminals from the bufffer: $\{i,i+1,\dots,\max(s_{g})\}$ .

Following Cross and Huang (2016b), we define $\mathsf{next}(c,t^{*})$ as the smallest reachable gold constituent from a configuration $c$ . Formally:

[TABLE]

Oracle algorithm

We first define the oracle $\mathsf{o}$ for the odd step of a configuration $c=(S,s_{f},i,C)$ :

[TABLE]

For even steps, assuming $\mathsf{next}(c,t^{*})=(X,s_{g})$ , we define the oracle as follows:

[TABLE]

We provide a proof of the correctness of the oracle in Appendix A.

3 A Neural Network based on Constituent Boundaries

We first present an encoder that computes context-aware representations of tokens (Section 3.1). We then discuss how to compute the representation of a set of tokens (Section 3.2). We describe the action scorer (Section 3.3), the POS tagging component (Section 3.4), and the objective function (Section 3.5).

3.1 Token Representations

As in recent proposals in dependency and constituency parsing Cross and Huang (2016a); Kiperwasser and Goldberg (2016), our scoring system is based on a sentence transducer that constructs a context-aware representation for each token.

Given a sequence of tokens $x_{1}^{n}=(x_{1},\dots,x_{n})$ , we first run a single-layer character bi-LSTM encoder $\mathbf{c}$ to obtain a character-aware embedding $\mathbf{c}(x_{i})$ for each token. We represent a token $x_{i}$ as the concatenation of a standard word embedding $\mathbf{e}(x_{i})$ and the character-aware embedding: $\mathbf{w}_{x_{i}}=[\mathbf{c}(x_{i});\mathbf{e}(x_{i})].$

Then, we run a 2-layer bi-LSTM transducer over the sequence of token representations:

[TABLE]

The parser uses the context-aware token representations $\mathbf{h}_{i}^{(2)}$ to construct vector representations of sets or constituents.

3.2 Set Representations

An open issue in neural discontinuous parsing is the representation of discontinuous constituents. In projective constituency parsing, it has become standard to use the boundaries of constituents Hall et al. (2014); Crabbé (2015); Durrett and Klein (2015), an approach that proved very successful with bi-LSTM token representations Cross and Huang (2016b); Stern et al. (2017).

Although constituent boundary features improves discontinuous parsing Coavoux and Crabbé (2017a), relying only on the left-index and the right-index of a constituent has the limitation of ignoring gaps inside a constituent. For example, since the two VPs in Figure 1 have the same right-index and left-index, they would have the same representations. It may also happen that constituents with identical right-index and left-index do not have the same labels.

We represent a (possibly partial) constituent with the yield $s$ , by computing 4 indexes from $s$ : $(\min(s),\max(s),\min(\overline{s}),\max(\overline{s}))$ . The set $\overline{s}$ represents the gap in $s$ , i.e. the tokens between $\min(s)$ and $\max(s)$ that are not in the yield of $s$ :

[TABLE]

Finally, we extract the corresponding token representations of the 4 indexes and concatenate them to form the vector representation $\mathbf{r}(s)$ of $s$ :

[TABLE]

For an index set that does not contain a gap, we have $\overline{s}=\emptyset$ . To handle this case, we use a parameter vector $\mathbf{h}_{\mathsf{nil}}$ , randomly initialized and learned jointly with the network, to embed $\max(\emptyset)=\min(\emptyset)=\mathsf{nil}$ .

For example, the constituents (VP, {1, 6}) and (VP, {1, 5, 6}) will be respectively vectorized as:

[TABLE]

This representation method makes sure that two distinct index sets have distinct representations, as long as they have at most one gap each. This property no longer holds if one index sets has more than one gap.

3.3 Action Scorer

For each type of action –structural or labelling– we use a feedforward network with two hidden layers.

Structural actions

At structural steps, for a configuration $c=(S,s_{f},i,C)$ , we need to compute the score of $|S|$ combine actions and possibly a shift action. In our approach, the score of a combine- $s$ action only depends on $s$ and $s_{f}$ and is independent of the rest of the configuration (i.e. other items in the set). We first construct input matrix $M$ as follows:

[TABLE]

Each of the first $n$ columns of matrix $M$ represents the input for a combine action, whereas the last column is the input for the shift action. We then compute the score of each structural action:

[TABLE]

where $\text{FF}_{s}$ is a feedforward network with two hidden layers, a $\tanh$ activation and a single output unit. In other words, it outputs a single scalar for each column vector of matrix $M$ . This part of the network can be seen as an attention mechanism, where the focus item is the query, and the context is formed by the items in the set and the first element in the buffer.

Labelling actions

We compute the probabilities of labelling actions as follows:

[TABLE]

where $\text{FF}_{l}$ is a feedforward network with two hidden layers activated with the $\tanh$ function, and $|N|+1$ output units, where $N$ is the set of nonterminals.

3.4 POS Tagger

Following Coavoux and Crabbé (2017b), we use the first layer of the bi-LSTM transducer as input to a Part-of-Speech (POS) tagger that is learned jointly with the parser. For a sentence $x_{1}^{n}$ , we compute the probability of a sequence of POS tags $t_{1}^{n}=(t_{1},\dots,t_{n})$ as follows:

[TABLE]

where $\mathbf{W}^{(t)}$ and $\mathbf{b}^{(t)}$ are parameters.

3.5 Objective Function

In the static oracle setting, for a single sentence $x_{1}^{n}$ , we optimize the sum of the log-likelihood of gold POS-tags $t_{1}^{n}$ and the log-likelihood of gold parsing actions $a_{1}^{n}$ :

[TABLE]

We optimize this objective by alternating a stochastic step for the tagging objective and a stochastic step for the parsing objective, as is standard in multitask learning Caruana (1997).

In the dynamic oracle setting, instead of optimizing the likelihood of the gold actions (assuming all previous actions were gold), we optimize the likelihood of the best actions, as computed by the dynamic oracle, from a configuration sampled from the space of all possible configurations. In practice, before each epoch, we sample each sentence from the training corpus with probability $p$ and we use the current (non-averaged) parameters to parse the sentence and generate a sequence of configurations. Instead of selecting the highest-scoring action at each parsing step, as in a normal inference step, we sample an action using the softmax distribution computed by the parser, as done by Ballesteros et al. (2016). Then, we use the dynamic oracle to calculate the best action from each of these configurations. In case there are several best actions, we deterministically choose a single action by favoring a combine over a shift (to bias the model towards a small memory), and to combine with the item with the highest right-index (to avoid spurious discontinuity in partial constituents). We train the parser on these sequences of potentially non-gold configuration-action pairs.

4 Experiments

We carried out experiments to assess the adequacy of our system and the effect of training with the dynamic oracle. We present the three discontinuous constituency treebanks that we used (Section 4.1), our experimental protocol (Section 4.2), then we discuss the results (Section 4.3) and the efficiency of the parser (Section 4.4).

4.1 Datasets

We perform experiments on three discontinuous constituency corpora. The discontinuous Penn Treebank was introduced by Evang and Kallmeyer (2011) who converted the long distance dependencies encoded by indexed traces in the original Penn treebank Marcus et al. (1993) to discontinuous constituents. We used the standard split (sections 2-21 for training, 22 for development and 23 for test). The Tiger corpus Brants et al. (2004) and the Negra corpus Skut et al. (1997) are both German treebanks natively annotated with discontinuous constituents. We used the SPMRL split for the Tiger corpus Seddah et al. (2013), and the split of Dubey and Keller (2003) for the Negra corpus.

4.2 Implementation and Protocol

We implemented our parser in Python using the Pytorch library Paszke et al. (2017). We trained each model with the ASGD algorithm Polyak and Juditsky (1992) for 100 epochs. Training a single model takes approximately a week with a GPU. We evaluate a model every 4 epochs on the validation set and select the best performing model according to the validation F-score. We refer the reader to Table 5 of Appendix B for the full list of hyperparameters.

We evaluate models with the dedicated module of discodop222https://github.com/andreasvc/disco-dop van Cranenburgh et al. (2016). We use the standard evaluation parameters (proper.prm), that ignore punctuations and root symbols. We report two evaluation metrics: a standard Fscore (F) and an Fscore computed only on discontinuous constituents (Disc. F), which provides a more qualitative evaluation of the ability of the parser to recover long distance dependencies.

4.3 Results

Effect of Dynamic Oracle

We present parsing results on the development sets of each corpus in Table 3. The effect of the oracle is in line with other published results in projective constituency parsing Coavoux and Crabbé (2016); Cross and Huang (2016b) and dependency parsing Goldberg and Nivre (2012); Gómez-Rodríguez et al. (2014): the dynamic oracle improves the generalization capability of the parser.

External comparisons

In Table 4, we compare our parser to other transition-based parsers Maier (2015); Coavoux and Crabbé (2017a); Stanojević and Garrido Alhama (2017); Coavoux et al. (2019), the pseudo-projective parser of Versley (2016), grammar-based chart parsers Evang and Kallmeyer (2011); van Cranenburgh et al. (2016); Gebhardt (2018) and parsers based on dependency parsing Fernández-González and Martins (2015); Corro et al. (2017). Note that some of them only report results in a gold POS tag setting (the parser has access to gold POS tags and use them as features), a setting that is much easier than ours.

Our parser matches the state of the art of Coavoux et al. (2019). This promising result shows that it is feasible to design accurate transition systems without an ordered memory.

4.4 Efficiency

Our transition system derives a tree for a sentence of $n$ words in exactly $4n-2$ transitions. Indeed, there must be $n$ shift actions, and $n-1$ combine actions. Each of these $2n-1$ transitions must be followed by a single labelling action.

The statistical model responsible for choosing which action to perform at each parsing step needs to score $|S|+1$ actions for a structural step and $|N|+1$ actions for a labelling step (where $N$ is the set of possible nonterminals). Since in the worst case, $|S|$ contains $n-1$ singletons, the parser has an $\mathcal{O}(n(|N|+n))$ time complexity.

In practice, the memory of the parser $S$ remains relatively small on average. We report in Figure 3 the distribution of the size of $S$ across configurations when parsing the development sets of three corpora. For the German treebanks, the memory contains 7 or fewer elements for more than 99 percents of configurations. For the Penn treebank, the memory is slighlty larger, with 98 percents of configuration with 11 or fewer items.

We report empirical runtimes in Table 6 of Appendix C. Our parser compares decently with other transition-based parsers, despite being written in Python.

5 Related Work

Existing transition systems for discontinuous constituency parsing rely on three main strategies for constructing discontinuous constituents: a swap-based strategy, a split-stack strategy, and the use of non-local transitions.

Swap-based systems

Swap-based transition systems are based on the idea that any discontinuous constituency tree can be transformed into a projective tree by reordering terminals. They reorder terminals by swapping them with a dedicated action (swap), commonly used in dependency parsing Nivre (2009). The first proposals in transition-based discontinuous constituency parsing used the swap action on top of an easy-first parser Versley (2014a, b). Subsequent proposals relied on a shift-reduce system Maier (2015); Maier and Lichte (2016) or a shift-promote-adjoin system Stanojević and Garrido Alhama (2017).

The main limitation of swap-based system is that they tend to require a large number of transitions to derive certain trees. The choice of an oracle that minimizes derivation lengths has a substantially positive effect on parsing Maier and Lichte (2016); Stanojević and Garrido Alhama (2017).

Split-stack systems

The second parsing strategy constructs discontinuous constituents by allowing the parser to reduce pairs of items that are not adjacent in the stack. In practice, Coavoux and Crabbé (2017a) split the usual stack of shift-reduce parsers into two data structures (a stack and a double-ended queue), in order to give the parser access to two focus items: the respective tops of the stack and the dequeue, that may or may not be adjacent. A dedicated action, gap, pushes the top of the stack onto the bottom of the queue to make the next item in the stack available for a reduction.

The split stack associated with the gap action can be interpreted as a linear-access memory: it is possible to access the $i^{\text{th}}$ element in the stack, but it requires $i$ operations.

Non-local transitions

Non-local transitions generalize standard parsing actions to non-adjacent elements in the parsing configurations. Maier and Lichte (2016) introduced a non-local transition SkipShift- $i$ which applies shift to the $i^{\text{th}}$ element in the buffer. Non-local transitions are also widely used in non-projective dependency parsing Attardi (2006); Qi and Manning (2017); Fernández-González and Gómez-Rodríguez (2018).

The key difference between these systems and ours is that we use an unordered memory. As a result, the semantics of the combine- $s$ action we introduce in Section 2 is independent from a specific position in the stack or the buffer. A system with an action such as SkipShift- $i$ needs to learn parameters with every possible $i$ , and will only learn parameters with the SkipShift- $i$ actions that are required to derive the training set. In contrast, we use the same parameters to score each possible combine- $s$ action.

6 Conclusion

We have presented a novel transition system that dispenses with the use of a stack, i.e. a memory with linear sequential access. Instead, the memory of the parser is represented by an unordered data structure with random-access: a set. We have designed a dynamic oracle for the resulting system and shown their empirical potential with state-of-the-art results on discontinuous constituency parsing of one English and two German treebanks. Finally, we plan to adapt our system to non-projective dependency parsing and semantic graph parsing.

Acknowledgments

We thank Caio Corro, Giorgio Satta, Marco Damonte, as well as NAACL anonymous reviewers for feedback and suggestions. We gratefully acknowledge the support of Huawei Technologies.

Appendix A Oracle Correctness

The oracle $\mathsf{o}$ leads to the reachable tree with the highest F-score with respect to the gold tree. The F-score of a predicted tree $\hat{t}$ (represented as a set of instantiated constituents) with respect to a gold tree $t^{*}$ is defined as:

[TABLE]

By definition, $\mathsf{o_{odd}}$ is optimal for precision because it constructs a constituent only if it is gold, and optimal for recall because it will construct a gold constituent if it is possible to do so.

Moreover, $\mathsf{o_{even}}$ is optimal for recall because any gold constituent reachable from $c$ will still be reachable after any transition in $\mathsf{o_{even}}(c,t^{*})$ . Assuming a configuration $c=(S,s_{f},i,C)$ and $\mathsf{next}(c,t^{*})=s_{g}$ , we consider separately the shift case and the combine- $s$ case:

•

shift case ( $\max(s_{g})>\max(s_{f})$ ): constituents $(X,s)$ reachable from $c$ and not reachable from shift( $c$ ) satisfy $\max(s)=i$ . If a gold constituent satisfies this property, we have $s\preceq s_{g}$ , which contradicts the assumption that $s_{g}=\mathsf{next}(c,t^{*})$ (see definition of oracle in Section 2.2.2).

•

combine- $s$ case: Let $(X,s^{\prime})$ be a reachable gold constituent. Since it is compatible with $s_{g}$ , there are three possible cases:

–

if $(X,s^{\prime})$ is an ascendant of $s_{g}$ , then $(s\cup s_{f})\subseteq s_{g}\subset s^{\prime}$ , therefore $(X,s^{\prime})$ is still reachable from combine- $s$ ( $c$ ).

–

if $(X,s^{\prime})$ is a descendant of $s_{g}$ then $s^{\prime}\preceq s_{g}$ , which contradicts the definition of $s_{g}$ .

–

if $s^{\prime}$ and $s_{g}$ are completely disjoint, we have $s^{\prime}\cap s=s^{\prime}\cap s_{f}=\emptyset$ , therefore $s^{\prime}\cap(s\cup s_{f})=\emptyset$ , and $s^{\prime}$ is still reachable from combine- $s$ ( $c$ ).

Finally, since $\mathsf{o_{even}}$ does not construct new constituents (it is the role of labelling actions), it is optimal for precision.

Appendix B Hyperparameters

The list of hyperparameters is presented in Table 5.

•

We use learning rate warm-up (linear increase from 0 to $t_{1000}$ during the first 1000 steps).

•

Before the $t^{th}$ update, we add Gaussian noise to the gradient of every parameter with mean 0 and variance $\dfrac{0.01}{(1+t)^{0.55}}$ Neelakantan et al. (2015).

•

All experiments use greedy search decoding (we did not experiment with beam search).

•

Before each training step, we replace a word embedding by an ‘UNK’ pseudo-word embedding with probability $0.3$ . We only do this replacement for the least frequent word-types ( $\frac{2}{3}$ least frequent word-types). The ‘UNK’ embedding is then used to represent unknown words.

•

We apply dropout at the input of the tagger and the input of action scorers: each single prediction has its own dropout mask.

Appendix C Detailed Results

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Attardi (2006) Giuseppe Attardi. 2006. Experiments with a multilanguage non-projective dependency parser . In Proceedings of the Tenth Conference on Computational Natural Language Learning (Co NLL-X) , pages 166–170. Association for Computational Linguistics.
2Ballesteros et al. (2016) Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Noah A. Smith. 2016. Training with exploration improves a greedy stack lstm parser . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 2005–2010, Austin, Texas. Association for Computational Linguistics. · doi ↗
3Bottou (2010) Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent . In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010) , pages 177–187, Paris, France. Springer.
4Brants et al. (2004) Sabine Brants, Stefanie Dipper, Peter Eisenberg, Silvia Hansen-Schirra, Esther König, Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit. 2004. Tiger: Linguistic interpretation of a german corpus . Research on Language and Computation , 2(4):597–620. · doi ↗
5Caruana (1997) Rich Caruana. 1997. Multitask learning . Machine Learning , 28(1):41–75. · doi ↗
6Coavoux and Crabbé (2016) Maximin Coavoux and Benoit Crabbé. 2016. Neural greedy constituent parsing with dynamic oracles . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 172–182, Berlin, Germany. Association for Computational Linguistics.
7Coavoux and Crabbé (2017 a) Maximin Coavoux and Benoit Crabbé. 2017 a. Incremental discontinuous phrase structure parsing with the gap transition . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , pages 1259–1270, Valencia, Spain. Association for Computational Linguistics.
8Coavoux and Crabbé (2017 b) Maximin Coavoux and Benoit Crabbé. 2017 b. Multilingual lexicalized constituency parsing with word-level auxiliary tasks . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 331–336, Valencia, Spain. Association for Computational Linguistics.