Technical notes: Syntax-aware Representation Learning With Pointer   Networks

Matteo Grella

arXiv:1903.07161·cs.CL·March 19, 2019

Technical notes: Syntax-aware Representation Learning With Pointer Networks

Matteo Grella

PDF

Open Access

TL;DR

This paper introduces a novel sequence-to-sequence dependency parsing model combining BiLSTM and Pointer Networks with logistic regression, showing promising initial results on the English Penn-treebank dataset.

Contribution

It proposes a new dependency parsing approach using Pointer Networks with logistic regression, emphasizing the development of latent syntactic knowledge.

Findings

01

Achieved 93.14% UAS on Penn-treebank without fine-tuning.

02

Outperforms some existing baselines by 2-3%.

03

Provides a promising baseline for future improvements.

Abstract

This is a work-in-progress report, which aims to share preliminary results of a novel sequence-to-sequence schema for dependency parsing that relies on a combination of a BiLSTM and two Pointer Networks (Vinyals et al., 2015), in which the final softmax function has been replaced with the logistic regression. The two pointer networks co-operate to develop a latent syntactic knowledge, by learning the lexical properties of "selection" and the lexical properties of "selectability", respectively. At the moment and without fine-tuning, the parser implementation gets a UAS of 93.14% on the English Penn-treebank (Marcus et al., 1993) annotated with Stanford Dependencies: 2-3% under the SOTA but yet attractive as a baseline of the approach.

Tables2

Table 1. Table 1: Hyper-parameters used for the baseline.

Hyper-param	Value
Pre-trained word embedding dimension	100
Word embedding dimension	150
IndPtrNets hidden dimension	100
IndPtrNets hidden activation	Tanh
IndPtrNets attention transformation	Affine
IndPtrNets output activation	Sigmoid
BiLSTMs activations	Tanh
BiLSTMs levels	2
$α$ (word dropout)	0.25

Table 2. Table 2: Evaluation of different parser configurations.

Parser	Method	UAS
p1 (this work)	heads+deps (avg scores)	93.14 (+0.27)
p2	heads+deps (heads scores)	92.87 (+0.11)
p3	heads+deps (deps scores)	92.76 (+0.32)
p4	heads	92.44 (+0.07)
p5	deps	92.37

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM · Softmax

Full text

Technical notes: Syntax-aware Representation Learning With Pointer Networks

Matteo Grella

Turin, Italy

[email protected]

Abstract

This is a work-in-progress report, which aims to share preliminary results of a novel sequence-to-sequence schema for dependency parsing that relies on a combination of a BiLSTM and two Pointer Networks (Vinyals et al., 2015), in which the final softmax function has been replaced with the logistic regression. The two pointer networks co-operate to develop a latent syntactic knowledge, by learning the lexical properties of “selection” and the lexical properties of “selectability”, respectively. At the moment and without fine-tuning, the parser implementation gets a UAS of 93.14% on the English Penn-treebank (Marcus et al., 1993) annotated with Stanford Dependencies: 2-3% under the SOTA but yet attractive as a baseline of the approach.

1 Introduction

The syntactic analysis via dependency parsing is considered to be a fundamental step for language processing because of its key importance in mediating between linguistic expression and meaning.

Modern approaches to dependency parsing (Dyer et al. (2015), Ballesteros et al. (2016), Kiperwasser and Goldberg (2016a), to name a few) use deep neural models as auxiliary components to the traditional transition-based and graph-based parsing algorithms (Kubler et al., 2009).

As soon as the deep learning models proved to be successful in capturing the relevant information for the syntactic analysis, there has been a considerable increase of the number of parsing architectures in which the neural component is predominant, relaxing the needs of algorithmic constraints.

Among the recent parsing designs that differ from the traditional approaches, there are Dozat and Manning (2017) who achieve state-of-the-art accuracies using a biaffine attention in a simple graph-based dependency parser; Kiperwasser and Ballesteros (2018) and Li et al. (2019) who use a sequence-to-sequence schema which does not rely on any transition sequence by directly predicting the relative position of the head for each word in the sentence; Grella and Cangialosi (2018) who use a bidirectional recurrent autoencoder to reconstruct for each i-word the j-word corresponding to its head in the sentence; Ma et al. (2018) who combine pointer networks to build the dependency tree in a top-down (from root-to-leaf) depth-first fashion; Strzyz et al. (2019) who use a sequence labeling strategy that outputs for each word the “relative PoS-based encoding” to find its head in the sentence.

The more a neural parser is independent of a superstructure111For independent of a superstructure, we mean no transition-based framework, and in case of graph-based parsing the arcs scores are obtained before the search procedure., the more it is reasonable to think that the underlying neural model has learnt a “syntactic knowledge” such as to perform the task of dependency parsing at hand.

As a by-product of an encoder-decoder parsing schema, it is possible to use the internal parser encoded representation to boost the perfomance of other high-level tasks that benefit from syntactic information. For instance, Kiperwasser and Ballesteros (2018) proposed a scheduled multi-task learning framework to train an encoder-decoder machine translation system sharing the encoder with a seq2seq dependency parser, concluding that syntactic auxiliary tasks are helpful not solely for machine translation but potentially for other systems as well.

Following the recent trend, this paper introduces a parsing model that relies on a combination of a BiLSTM and two Pointer Networks Vinyals et al. (2015) over the linear sequence of tokens capable to hanlde unrestricted non-projective sentences.

The model is trained on two complementary syntactic tasks, as detailed in the sections below, with the aim to create a robust syntactic representation at the encoding layer.

At the moment and without fine-tuning, the parser implementation gets a UAS of 93.14% on the English Penn-treebank (Marcus et al., 1993) annotated with Stanford Dependencies: 2-3% under the SOTA but yet attractive as a baseline of the approach.

Extensive parsing evaluations, as well as experiments on the contribution that the dense representation resulting from the encoding layer could give to other high-level tasks, are still in the planning phase.

2 Our Approach

2.1 Linguistic Background

Long-standing theories and formalisms (Tesnière (1959), Sgall et al. (1986), Mel’cuk (1988), Hudson (1990)) share the fundamental assumption that syntactic structure consists of word-to-word dependencies, i.e., lexical nodes linked by binary asymmetrical relations.

More formally, dependencies can be represented as a set of directed arcs of the form g $\xrightarrow{\textit{l}}$ d, where g is the head/governor node, d is the dependent node (g $\neq$ d) and l is the label, resulting in a dependency structure called dependency tree. Hence the name Dependency Grammar (DG).

The DG is sometimes called Valency Grammar, a name conceived by the analogy between the chemical valency and the thematic-argumental structure: a description of the elements that can depend on the word under consideration (its necessary complements, named arguments, and its optional complements, called modifiers).222Because of this analogy, sometimes it is possible to call the words “atoms”.

This possibility of a word to combine with other words selecting them as its dependents is hereby called lexical selection property. The possibility of a word, instead, of being dependent on another word is hereby called lexical selectability property.

We can therefore say that a well-formed sentence is a set of words combined so that the selectability and selection properties of each word are satisfied.

For more details on dependency tree, dependency grammar and dependency parsing see Nivre (2003) and the references cited therein.

2.2 Neural Building Blocks

Here is a quick overview of the main neural modules used in our approach.

See Goldberg (2017) for an extensive introduction of the neural building blocks used for the natural language processing.

2.2.1 BiLSTM

The BiLSTM Graves (2008); Irsoy and Cardie (2014) consists in a bidirectional LSTMs Hochreiter and Schmidhuber (1997) capable to learn bidirectional long-term dependencies between time steps of time series or sequence data.

The BiLSTM is a well established neural model used to represent the sentence tokens in their surrounding context.333Kiperwasser and Goldberg (2016a) were the first who demonstrated the effectiveness of using a conceptually simple BiLSTM in the context of dependency parsing.

2.2.2 Pointer Network

The Pointer Network Vinyals et al. (2015) is a type of neural network that works with a variable number of inputs and uses the attention mechanism (Bahdanau et al., 2015; Luong et al., 2015) to learn and predict the conditional probability of an output sequence, with elements that are discrete tokens corresponding to positions in an input sequence.

In other words, the pointer networks use a softmax output whose dimension is dynamic and corresponds to each input sequence in a such way that the output space is constrained to be the observation of the input sequence (not the input space), maximizing the attention probability of the target input.

2.3 The Idea

Like Zhang et al. (2016) and other recent approaches, we formalize the dependency parsing as the task of finding for each word in a sentence its most probable head without tree structure constraints.

The particularity of our approach is that we consider the probablilty of a wordi to be the head of another wordj as the average of the probability of the wordj (dependent) to be selected by the wordi (governor), and the probability of the wordi to select the wordj as its dependent.

In fact, we want the neural model to learn the syntactic properties of “selectability” and “selection” more explicitly than other models that take into account only one of these two syntactic aspects.

To do this, we intuitively cast the main dependency parsing task in two sub-tasks: heads-pointing for the selectability property and dependents-pointing for the selection property. In essence, the task of heads-pointing consists to find the most probable head of a given word in the sentence; the task of dependents-pointing consists to find the most probable dependents of a given word in the sentence.

The gist of the idea is that these sub-tasks should develop different views of the same problem and thus increase the robustness of the learnt syntactic knowledge.

To solve the “problems of pointing” we decided to experiment with the Pointer Networks (Ptr-Net). The use of these networks in dependency parsing is not new: in Section 3 we have highlighted the main difference among the other models and our approach.

In the Ptr-Net, the output of the attention mechanism is a softmax distribution, which allows to point to the heads because for each word there is only one governor according to the dependency grammar.444Even if it would force the use of a virtual root for the top node On the other side, since the dependents of a word can be more than one, the softmax function cannot be used to point to them, and this sets limits to the use of the Ptr-Net.

To overcome these limits we introduce a variant of the original Ptr-Net, in which the final softmax function is replaced with the logistic regression, so that independent predictions can enable multiple pointing at the same time as well as allowing no pointing at all. We call our variant Ind-Ptr-Net (Independent Pointer Network).555To date, the author is surprised not to have found any other references that deal with this variant of the Pointer Network.

As detailed in section 2.4, the Ind-Ptr-Net is the backbone of our approach.

2.4 Training and Inference Processes

Our model is composed of one BiLSTM and two Ind-Ptr-Nets.

The BiRNN works as an encoder, receiving in input the tokens of a sentence (already transformed in a dense representation) and generating the context vectors, that represent them in the sentence context. The context vectors are in turn given as input sequence to the two Ind-Ptr-Nets.

Subsequently, the decoding process feeds again each context vector into these Ind-Ptr-Nets to perform the heads-pointing task and dependents-pointing task respectively.

Training.

During the training phase,

$-$

the Ind-Ptr-Net for the heads-pointing is trained to activate the output corresponding to the position of the head of the word under consideration (i.e., set it to 1.0). In case this word is the top, it is trained not to activate any output (i.e., set them all to zero).

$-$

the Ind-Ptr-Net for the dependents-pointing is trained to activate the outputs corrisponding to the positions of the dependents of the word under consideration. In case this word has no dependent, it is trained not to activate any output.

The gradients are propagated from the two Ind-Ptr-Nets all the way back, through the BiLSTM until the initial tokens embeddings (which are trained together with the model).

Inference.

During the inference phase, the outputs of the two Ind-Ptr-Nets are merged by averaging the attention values before the sigmoid activation.

To construct the dependency tree we select, for each token, the head with the highest score. The top token of the sentence is found before assigning the other heads, looking for the token which has among all other tokens the pointers to the heads with the lowest scores (ideally, with all the scores equal to zero).666To construct a labeled dependency tree, it is possible to add a simple feedforward network that computes a classification of the labels giving in input the context-vectors of each dependent-governor pair. However, we prefer to run further experiments before including any labeling results in this report.

At test time, we ensure that the dependency tree given in output is well-formed by iteratively identifying and fixing cycles with simple heuristics, without any loss in accuracy.777For each cycle, the fix is done by removing the arc with the lowest score and assigning to its dependent the node that maximizes its latent head similarity without introducing new cycles.

We empirically observed that during the decoding most outputs are already trees, without the need to fix cycles. It seems to confirm once again that the linear sequence of tokens itself is sufficient to recover the underlying dependency structure (Zhang et al. (2016)).

3 Related Approaches

The use of Pointer Networks Vinyals et al. (2015) in dependency parsing has been previously experimented by Chorowski et al. (2016) and Jung et al. (2019) who use the Ptr-Nets to predict the heads, and Ma et al. (2018) who use the Ptr-Nets to predict the dependents.

The main difference with these two first approaches that “point to the heads”, can be found on how the root is selected, meaning that in our model it is not required a virtual element: the top word is recognized as an emerging syntactic property because of the absence of strong connections with other words considered as heads.

The main difference with the approach that “point to the dependents”, is that in our model it is not required to define a deterministic decoding process to select a dependent in multiple time steps, but all the dependents of a word can be pointed to simulteneously.

4 Experiments and Results

The parser is implemented in Kotlin, using the SimpleDNN deep learning library888https://github.com/KotlinNLP/SimpleDNN. The code will be released at the GitHub author repository soon.

A performance evaluation has been carried out on the Penn Treebank (PTB) Marcus et al. (1993) converted to Stanford Dependencies (Marneffe et al., 2006) following the standard train/dev/test splits and without considering punctuation markers. This dataset contains a few non-projective trees.

Our baseline is obtained following the unlabeled parsing approach described in section 2.4.

A good initial tokens encoding is crucial to obtain high results in neural parsing, especially for richly inflected languages.999For example, adding subword information with character-based representation to the words embedding has been shown to be effective enough to compensate the lack of POS tags information (Dozat et al. (2017)).

However, rather then top parsing accuracy, in this study we focus more on the ability of the proposed model to learn a latent representation capable to capture the information needed for the syntactic analysis; so, for our baseline, we opted for a simple encoded representation of the input tokens.

We encode the input tokens concatenating the vectors obtained from two embedding maps. The first associates the words found in the training-set with randomly initialized vectors; the second contains pre-trained word embeddings.101010The pre-trained word embeddings are the same used in Dyer et al. (2015) and Kiperwasser and Goldberg (2016b); the random values are generated using the Glorot initialization. Both maps are fine-tuned during the training phase.

During the training we replace the embedding vector of a word with an “unknown vector” with a probability that is inversely proportional to the frequency of the word in the tree-bank (tuned with an $\alpha$ coefficient).

We optimize the parameters with the Adam update method Kingma and Ba. (2015) with default parameters ( $\alpha$ = 0.001 $\beta$ = 0.9 $\beta$ = 0.999). We performed a very minimal tuning of the hyper-parameters; the values used for our baseline are reported in Table 1.

We evaluated five different configurations of the parser (p).

$p1$

the parser is trained to perform both the heads-pointing and the dependents-pointing. The scores of the pointers to the heads are the average of the scores resulting from the two sub-tasks;

$p2$

the parser is trained to perform both the heads-pointing and the dependents-pointing. The scores of the pointers result from the 1st task only;

$p3$

the parser is trained to perform both the heads-pointing and the dependents-pointing. The scores of the pointers result from the 2nd task only;

$p4$

the parser is trained to perform the heads-pointing only;

$p5$

the parser is trained to perform the dependents-pointing only;

We trained the four instances of the parsers with different random seeds up to 10 epochs, and for each parser we selected the model from the epoch with the best accuracy on the development set. The average of the experimental results in Table 2.

This section will be updated with further experiments soon.

Observation of the results:

With these first experimental results (Table 2), we can observe that the two sub-tasks taken individually (p4 and p5) get almost the same performance.

As soon as the sub-tasks are trained together we can appreciate an increase in performance, even when in the inference phase only one of the two tasks (p3 or p2) is considered.

When, in addition to the joint training, the average of the results of the two sub-tasks is calculated, a further increase of correct arcs is obtained (p1).

5 Conclusion and Future Works

The main objective of this study is to verify the hypotesis that an explicit learning process that consider both the lexical properties of “selectability” and “selection” can result in a more “aware” syntactic representation. For this purpose, we are investigating what kind of “knowledge of language” the proposed neural model is capturing, extending the tests to grammaticality judgments and visualizing which information the networks consider more important in a given moment Karpathy et al. (2015).111111In our experiments we found that the RAN Lee et al. (2015) is a valid alternative to the LSTM of the bidirectional recurrent network, when speed and highly interpretable outputs are important.

In this paper we have introduced a simple encoder-decoder approach for dependency parsing that handles unrestricted non-projective dependencies naturally.

We introduced a variant of the Pointer Network, named Ind-Ptr-Net (Independent Pointer Netwoek), where the final softmax function is replaced with the logistic regression, so that independent predictions can enable multiple pointing at the same time as well as allowing no pointing at all.121212We also plan to experiment the use of the tanh function: as the tanh has a derivative of up to 1.0, we think that “larger updates” of the weights can result in better and faster learning process.

With the aim of understanding the potential and the limits of the proposed approach we intend to test more sophisticated initial tokens encodings, and to evaluate the parsing model on other tree-banks with a higher ratio of non-projective sentences.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ballesteros et al. (2016) Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Noah A. Smith. 2016. Training with exploration improves a greedy stack-LSTM parser. Co RR , 1603.
2Chorowski et al. (2016) Jan Chorowski, Michał Zapotoczny, and Paweł Rychlikowski. 2016. Read, tag, and parse all at once, or fully-neural dependency parsing.
3Dozat and Manning (2017) Timothy Dozat and Christopher D. Manning. 2017. 2017. In Proc. of ICLR . Deep biaffine attention for neural dependency parsing. In.
4Dozat et al. (2017) Timothy Dozat, Peng Qi, and Christopher D. Manning. 2017. Stanford’s graph-based neural dependency parser at the conll 2017 shared task in. Proceedings of the Co NLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , pages 20–30.
5Dyer et al. (2015) Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 334–343, China, July. Association for Computational Linguistics. Beijing.
6Goldberg (2017) Yoav Goldberg. 2017. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies , 10(1):1–309.
7Graves (2008) Alex Graves. 2008. Supervised sequence labelling with recurrent neural networks . Ph.D. thesis, Technical University Munich.
8Grella and Cangialosi (2018) Matteo Grella and Simone Cangialosi. 2018. Non-projective dependency parsing via latent heads representation (LHR) . Co RR , abs/1802.02116.