Sequence Labeling Parsing by Learning Across Representations

Michalina Strzyz; David Vilares; Carlos G\'omez-Rodr\'iguez

arXiv:1907.01339·cs.CL·January 8, 2020

Sequence Labeling Parsing by Learning Across Representations

Michalina Strzyz, David Vilares, Carlos G\'omez-Rodr\'iguez

PDF

1 Repo

TL;DR

This paper presents a multitask learning approach that uses sequence labeling to jointly learn constituency and dependency parsing, improving performance with minimal additional cost.

Contribution

It introduces a unified sequence labeling framework for both parsing paradigms and demonstrates that auxiliary tasks enhance parsing accuracy.

Findings

01

MTL models outperform single-task models in parsing accuracy

02

Auxiliary losses improve constituency parsing by 1.14 F1 points

03

Auxiliary losses improve dependency parsing by 0.62 UAS points

Abstract

We use parsing as sequence labeling as a common framework to learn across constituency and dependency syntactic abstractions. To do so, we cast the problem as multitask learning (MTL). First, we show that adding a parsing paradigm as an auxiliary loss consistently improves the performance on the other paradigm. Secondly, we explore an MTL sequence labeling model that parses both representations, at almost no cost in terms of performance and speed. The results across the board show that on average MTL models with auxiliary losses for constituency parsing outperform single-task ones by 1.14 F1 points, and for dependency parsing by 0.62 UAS points.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mstrise/seq2label-crossrep
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Sequence Labeling Parsing by Learning Across Representations

Michalina Strzyz David Vilares Carlos Gómez-Rodríguez

Universidade da Coruña, CITIC

FASTPARSE Lab, LyS Research Group, Departamento de Computación

Campus de Elviña, s/n, 15071 A Coruña, Spain

{michalina.strzyz,david.vilares,carlos.gomez}@udc.es

Abstract

We use parsing as sequence labeling as a common framework to learn across constituency and dependency syntactic abstractions. To do so, we cast the problem as multitask learning (mtl). First, we show that adding a parsing paradigm as an auxiliary loss consistently improves the performance on the other paradigm. Secondly, we explore an mtl sequence labeling model that parses both representations, at almost no cost in terms of performance and speed. The results across the board show that on average mtl models with auxiliary losses for constituency parsing outperform single-task ones by 1.14 F1 points, and for dependency parsing by 0.62 uas points.111This is a revision of https://arxiv.org/abs/1907.01339v2. The previous version contained a bug where the EVALB scripts were not considering the COLLINS.prm and spmrl.prm parameter files.

1 Introduction

Constituency Chomsky (1956) and dependency grammars Mel’cuk (1988); Kübler et al. (2009) are the two main abstractions for representing the syntactic structure of a given sentence, and each of them has its own particularities (Kahane and Mazziotta, 2015). While in constituency parsing the structure of sentences is abstracted as a phrase-structure tree (see Figure 1(a)), in dependency parsing the tree encodes binary syntactic relations between pairs of words (see Figure 1(b)).

When it comes to developing natural language processing (nlp) parsers, these two tasks are usually considered as disjoint tasks, and their improvements therefore have been obtained separately Charniak (2000); Nivre (2003); Kiperwasser and Goldberg (2016); Dozat and Manning (2017); Ma et al. (2018); Kitaev and Klein (2018).

Despite the potential benefits of learning across representations, there have been few attempts in the literature to do this. Klein and Manning (2002) considered a factored model that provides separate methods for phrase-structure and lexical dependency trees and combined them to obtain optimal parses. With a similar aim, Ren et al. (2013) first compute the n best constituency trees using a probabilistic context-free grammar, convert those into dependency trees using a dependency model, compute a probability score for each of them, and finally rerank the most plausible trees based on both scores. However, these methods are complex and intended for statistical parsers. Instead, we propose a extremely simple framework to learn across constituency and dependency representations.

Contribution

(i) We use sequence labeling for constituency Gómez-Rodríguez and Vilares (2018) and dependency parsing (Strzyz et al., 2019) combined with multi-task learning (mtl) Caruana (1997) to learn across syntactic representations. To do so, we take a parsing paradigm (constituency or dependency parsing) as an auxiliary task to help train a model for the other parsing representation, a simple technique that translates into consistent improvements across the board. (ii) We also show that a single mtl model following this strategy can robustly produce both constituency and dependency trees, obtaining a performance and speed comparable with previous sequence labeling models for (either) constituency or dependency parsing. The source code is available at https://github.com/mstrise/seq2label-crossrep

2 Parsing as Sequence Labeling

Notation

We use $w=[w_{i},...,w_{|w|}]$ to denote an input sentence. We use bold style lower-cased and math style upper-cased characters to refer to vectors and matrices (e.g. $\mathbf{\bf x}$ and $\mathbf{W}$ ).

Sequence labeling is a structured prediction task where each token in the input sentence is mapped to a label Rei and Søgaard (2018). Many nlp tasks suit this setup, including part-of-speech tagging, named-entity recognition or chunking Tjong Kim Sang and Buchholz (2000); Toutanvoa and Manning (2000); Tjong Kim Sang and De Meulder (2003). More recently, syntactic tasks such as constituency parsing and dependency parsing have been successfully reduced to sequence labeling Spoustová and Spousta (2010); Li et al. (2018); Gómez-Rodríguez and Vilares (2018); Strzyz et al. (2019). Such models compute a tree representation of an input sentence using $|w|$ tagging actions.

We will also cast parsing as sequence labeling, to then learn across representations using multi-task learning. Two are the main advantages of this approach: (i) it does not require an explicit parsing algorithm nor explicit parsing structures, and (ii) it massively simplifies joint syntactic modeling. We now describe parsing as sequence labeling and the architecture used in this work.

Constituency parsing as tagging

Gómez-Rodríguez and Vilares (2018) define a linearization method $\Phi_{|w|}:T_{c,|w|}\rightarrow L_{c}^{|w|}$ to transform a phrase-structure tree into a discrete sequence of labels of the same length as the input sentence. Each label $l_{i}\in L_{c}$ is a three tuple $(n_{i},c_{i},u_{i})$ where: $n_{i}$ is an integer that encodes the number of ancestors in the tree shared between a word $w_{i}$ and its next one $w_{i+1}$ (computed as relative variation with respect to $n_{i-1}$ ), $c_{i}$ is the non-terminal symbol shared at the lowest level in common between said pair of words, and $u_{i}$ (optional) is a leaf unary chain that connects $c_{i}$ to $w_{i}$ . Figure 1(a) illustrates the encoding with an example.222In this work we do not use the dual encoding by Vilares et al. (2019), which combines the relative encoding with a top-down absolute scale to represent certain relations.

Dependency parsing as tagging

Strzyz et al. (2019) also propose a linearization method $\Pi_{|w|}:T_{d,|w|}\rightarrow L_{d}^{|w|}$ to transform a dependency tree into a discrete sequence of labels. Each label $r_{i}\in L_{d}$ is also represented as a three tuple $(o_{i},p_{i},d_{i})$ . If $o_{i}>0$ , $w_{i}$ ’s head is the $o_{i}$ th closest word with PoS tag $p_{i}$ to the right of $w_{i}$ . If $o_{i}<0$ , the head is the $-o_{i}$ th closest word to the left of $w_{i}$ that has as a PoS tag $p_{i}$ . The element $d_{i}$ represents the syntactic relation between the head and the dependent terms. Figure 1(b) depictures it with an example.

Tagging with lstms

We use bidirectional lstms (bilstms) to train our models Hochreiter and Schmidhuber (1997); Schuster and Paliwal (1997). Briefly, let $\textsc{lstm}_{\rightarrow}(\mathbf{x})$ be an abstraction of a lstm that processes the input from left to right, and let $\textsc{lstm}_{\leftarrow}(\mathbf{x})$ be another lstm processing the input in the opposite direction, the output $h_{i}$ of a bilstm at a timestep $i$ is computed as: $\textsc{bilstm}(\mathbf{x},i)=\textsc{lstm}_{\rightarrow}(\mathbf{x}_{0:i})\circ\textsc{lstm}_{\leftarrow}(\mathbf{x}_{i:|w|})$ . Then, $h_{i}$ is further processed by a feed-forward layer to compute the output label, i.e. $P(y|\mathbf{h}_{i})=\mathit{softmax}(\mathbf{W}*\mathbf{h}_{i}+\mathbf{b})$ . To optimize the model, we minimize the categorical cross-entropy loss, i.e. $\mathcal{L}=-\sum{log(P(y|\mathbf{h}_{i}))}$ . In Appendix A we detail additional hyperpameters of the network. In this work we use NCRFpp Yang and Zhang (2018) as our sequence labeling framework.

3 Learning across representations

To learn across representations we cast the problem as multi-task learning. mtl enables learning many tasks jointly, encapsulating them in a single model and leveraging their shared representation (Caruana, 1997; Ruder, 2017). In particular, we will use a hard-sharing architecture: the sentence is first processed by stacked bilstms shared across all tasks, with a task-dependent feed-forward network on the top of it, to compute each task’s outputs. In particular, to benefit from a specific parsing abstraction we will be using the concept of auxiliary tasks Plank et al. (2016); Bingel and Søgaard (2017); Coavoux and Crabbé (2017), where tasks are learned together with the main task in the mtl setup even if they are not of actual interest by themselves, as they might help to find out hidden patterns in the data and lead to better generalization of the model.333Auxiliary losses are usually given less importance during the training process. For instance, Hershcovich et al. (2018) have shown that semantic parsing benefits from that approach.

The input is the same for both types of parsing and the same number of timesteps are required to compute a tree (equal to the length of the sentence), which simplifies the joint modeling. In this work, we focus on parallel data (we train on the same sentences labeled for both constituency and dependency abstractions). In the future, we plan to explore the idea of exploiting joint training over disjoint treebanks Barrett et al. (2018).

3.1 Baselines and models

We test different sequence labeling parsers to determine whether there are any benefits in learning across representations. We compare: (i) a single-task model for constituency parsing and another one for dependency parsing, (ii) a multi-task model for constituency parsing (and another for dependency parsing) where each element of the 3-tuple is predicted as a partial label in a separate subtask instead of as a whole, (iii) different mtl models where the partial labels from a specific parsing abstraction are used as auxiliary tasks for the other one, and (iv) an mtl model that learns to produce both abstractions as main tasks.

Single-paradigm, single-task models (s-s)

For constituency parsing, we use the single-task model by Gómez-Rodríguez and Vilares (2018). The input is the raw sentence and the output for each token a single label of the form $l_{i}$ = $(n_{i},c_{i},u_{i})$ . For dependency parsing we use the model by Strzyz et al. (2019) to predict a single dependency label of the form $r_{i}$ = $(o_{i},p_{i},d_{i})$ for each token.

Single-paradigm, multi-task models (s-mtl)

For constituency parsing, instead of predicting a single label output of the form $(n_{i},c_{i},u_{i})$ , we generate three partial and separate labels $n_{i}$ , $c_{i}$ and $u_{i}$ through three task-dependent feed-forward networks on the top of the stacked bilstms. This is similar to Vilares et al. (2019). For dependency parsing, we propose in this work a mtl version too. We observed in preliminary experiments, as shown in Table 1, that casting the problem as 3-task learning led to worse results. Instead, we cast it as a 2-task learning problem, where the first task consists in predicting the head of a word $w_{i}$ , i.e. predicting the tuple $(o_{i},p_{i})$ , and the second task predicts the type of the relation $(d_{i})$ . The loss is here computed as $\mathcal{L}$ = $\sum_{t}\mathcal{L}_{t}$ , where $\mathcal{L}_{t}$ is the partial loss coming from the subtask $t$ .

Double-paradigm, multi-task models with auxiliary losses (d-mtl-aux)

We predict the partial labels from one of the parsing abstractions as main tasks. The partial labels from the other parsing paradigm are used as auxiliary tasks. The loss is computed as $\mathcal{L}$ = $\sum_{t}\mathcal{L}_{t}+\sum_{a}\beta_{a}\mathcal{L}_{a}$ , where $\mathcal{L}_{a}$ is an auxiliary loss and $\beta_{a}$ its specific weighting factor. Figure 2 shows the architecture used in this and the following multi-paradigm model.

Double paradigm, multi-task models (d-mtl)

All tasks are learned as main tasks instead.

4 Experiments

4.1 Data

In the following experiments we use two parallel datasets that provide syntactic analyses for both dependency and constituency parsing.

PTB

For the evaluation on English language we use the English Penn Treebank Marcus et al. (1993), transformed into Stanford dependencies de Marneffe et al. (2006) with the predicted PoS tags as in Dyer et al. (2016).

SPMRL

We also use the spmrl datasets, a collection of parallel dependency and constituency treebanks for morphologically rich languages Seddah et al. (2014). In this case, we use the predicted PoS tags provided by the organizers. We observed some differences between the constituency and dependency predicted input features provided with the corpora. For experiments where dependency parsing is the main task, we use the input from the dependency file, and the converse for constituency, for comparability with other work. d-mtl models were trained twice (one for each input), and dependency and constituent scores are reported on the model trained on the corresponding input.

Metrics

We use bracketing F-score from the original evalb (together with COLLINS.prm) and eval_spmrl (together with spmrl.prm) official scripts to evaluate constituency trees. For dependency parsing, we rely on las and uas scores where punctuation is excluded in order to provide a homogeneous setup for PTB and SPMRL.

4.2 Results

Table 2 compares single-paradigm models against their double-paradigm mtl versions. On average, mtl models with auxiliary losses achieve the best performance for both parsing abstractions. They gain $1.14$ F1 points on average in comparison with the single model for constituency parsing, and $0.62$ uas and $0.15$ las points for dependency parsing. In comparison to the single-paradigm MTL models, the average gain is smaller: 0.05 f1 points for constituency parsing, and 0.09 uas and 0.21 las points for dependency parsing.

mtl models that use auxiliary tasks (d-mtl-aux) consistently outperform the single-task models (s-s) in all datasets, both for constituency parsing and for dependency parsing in terms of uas. However, this does not extend to las. This different behavior between uas and las seems to be originated by the fact that 2-task dependency parsing models, which are the basis for the corresponding auxiliary task and mtl models, improve uas but not las with respect to single-task dependency parsing models. The reason might be that the single-task setup excludes unlikely combinations of dependency labels with PoS tags or dependency directions that are not found in the training set, while in the 2-task setup, both components are treated separately, which may be having a negative influence on dependency labeling accuracy.

In general, one can observe different range of gains of the models across languages. In terms of uas, the differences between single-task and mtl models span between $1.22$ (Basque) and $-0.14$ (Hebrew); for las, $0.81$ and $-1.35$ (both for Hebrew); and for F1, $3.16$ (Hebrew) and $-0.31$ (English). Since the sequence labeling encoding used for dependency parsing heavily relies on PoS tags, the result for a given language can be dependent on the degree of the granularity of its PoS tags.

In addition, Table 3 provides a comparison of the d-mtl-aux models for dependency and constituency parsing against existing models on the PTB test set. Tables 4 and 5 shows the results for various existing models on the SPMRL test sets.444Note that we provide these SPMRL results for merely informative purposes. While they are the best existing results to our knowledge in these datasets, not all are directly comparable to ours (due to not all of them using the same kinds of information, e.g. some models do not use morphological features). Also, there are not many recent results for dependency parsing on the SPMRL datasets, probably due to the popularity of UD corpora. For comparison, we have included punctuation for the dependency parsing evaluation.

Table 6 shows the speeds (sentences/second) on a single core of a CPU555Intel Core i7-7700 CPU 4.2 GHz.. The d-mtl setup comes at almost no added computational cost, so the very good speed-accuracy tradeoff already provided by the single-task models is improved.

5 Conclusion

We have described a framework to leverage the complementary nature of constituency and dependency parsing. It combines multi-task learning, auxiliary tasks, and sequence labeling parsing, so that constituency and dependency parsing can benefit each other through learning across their representations. We have shown that mtl models with auxiliary losses outperform single-task models, and mtl models that treat both constituency and dependency parsing as main tasks obtain strong results, coming almost at no cost in terms of speed. Source code will be released upon acceptance.

Acknowlegments

This work has received funding from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150), from the ANSWER-ASAP project (TIN2017-85160-C2-1-R) from MINECO, and from Xunta de Galicia (ED431B 2017/01).

Appendix A Model parameters

The models were trained up to 150 iterations and optimized with Stochastic Gradient Descent (SGD) with a batch size of 8. The best model for constituency parsing was chosen with the highest achieved F1 score on the development set during the training and for dependency parsing with the highest las score. The best double paradigm, multi-task model was chosen based on the highest harmonic mean among las and F1 scores.

Table 7 shows model hyperparameters.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ballesteros (2013) Miguel Ballesteros. 2013. Effective morphological feature selection with Malt Optimizer at the SPMRL 2013 shared task . In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages , pages 63–70, Seattle, Washington, USA. Association for Computational Linguistics.
2Ballesteros et al. (2015) Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LST Ms . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 349–359, Lisbon, Portugal. Association for Computational Linguistics. · doi ↗
3Barrett et al. (2018) Maria Barrett, Joachim Bingel, Nora Hollenstein, Marek Rei, and Anders Søgaard. 2018. Sequence classification with human attention . In Proceedings of the 22nd Conference on Computational Natural Language Learning , pages 302–312, Brussels, Belgium. Association for Computational Linguistics. · doi ↗
4Bingel and Søgaard (2017) Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks . Co RR , abs/1702.08303.
5Björkelund et al. (2013) Anders Björkelund, Özlem Çetinoğlu, Richárd Farkas, Thomas Mueller, and Wolfgang Seeker. 2013. (re)ranking meets morphosyntax: State-of-the-art results from the SPMRL 2013 shared task . In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages , pages 135–145, Seattle, Washington, USA. Association for Computational Linguistics.
6Caruana (1997) Rich Caruana. 1997. Multitask learning. Machine learning , 28(1):41–75.
7Charniak (2000) Eugene Charniak. 2000. A maximum-entropy-inspired parser . In 1st Meeting of the North American Chapter of the Association for Computational Linguistics .
8Chen and Manning (2014) Danqi Chen and Christopher D. Manning. 2014. A fast and accurate dependency parser using neural networks . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 740–750, Doha, Qatar. Association for Computational Linguistics.