Attention-based Conditioning Methods for External Knowledge Integration

Katerina Margatina; Christos Baziotis; Alexandros Potamianos

arXiv:1906.03674·cs.LG·June 11, 2019

Attention-based Conditioning Methods for External Knowledge Integration

Katerina Margatina, Christos Baziotis, Alexandros Potamianos

PDF

1 Repo

TL;DR

This paper introduces three novel methods for integrating external lexicon-based knowledge into RNNs via attention mechanisms, improving task performance with minimal computational cost.

Contribution

The paper proposes three new attention-based conditioning techniques for external knowledge integration in RNNs, enhancing their effectiveness across multiple benchmarks.

Findings

01

Attentional gating improves performance consistently.

02

Methods are simple to implement with minimal overhead.

03

Effective across six benchmark datasets.

Abstract

In this paper, we present a novel approach for incorporating external knowledge in Recurrent Neural Networks (RNNs). We propose the integration of lexicon features into the self-attention mechanism of RNN-based architectures. This form of conditioning on the attention distribution, enforces the contribution of the most salient words for the task at hand. We introduce three methods, namely attentional concatenation, feature-based gating and affine transformation. Experiments on six benchmark datasets show the effectiveness of our methods. Attentional feature-based gating yields consistent performance improvement across tasks. Our approach is implemented as a simple add-on module for RNN-based models with minimal computational overhead and can be adapted to any deep neural architecture.

Tables2

Table 1. Table 1: The lexicons used as external knowledge.

Lexicons	Annotations	# dim.	# words
LIWC	psycho-linguistic	73	18,504
Bing Liu	valence	1	2,477
AFINN	sentiment	1	6,786
MPQA	sentiment	4	6,886
SemEval15	sentiment	1	1,515
Emolex	emotion	19	14,182

Table 2. Table 2: Description of benchmark datasets. We split 10% of the train set to serve as the validation set.

Dataset	Study	Task	Domain	Classes	$N_{t r a i n}$	$N_{t e s t}$
SST-5	Socher et al. (2013)	Sentiment	Movie Reviews	5	9,645	2,210
Sent17	Rosenthal et al. (2017)	Sentiment	Twitter	3	49,570	12,284
PhychExp	Wallbott and Scherer (1986)	Emotion	Experiences	7	1000	6480
Irony18	Van Hee et al. (2018)	Irony	Twitter	4	3,834	784
SCv1	Lukin and Walker (2013)	Sarcasm	Debate Forums	2	1000	995
SCv2	Oraby et al. (2016)	Sarcasm	Debate Forums	2	1000	2260

Equations10

a_{i}

a_{i}

r

f_{c} (h_{i}, c (w_{i}))

f_{c} (h_{i}, c (w_{i}))

f_{g} (h_{i}, c (w_{i}))

f_{g} (h_{i}, c (w_{i}))

f_{a} (h_{i}, c (w_{i})) = γ (c (w_{i})) ⊙ h_{i} + β (c (w_{i}))

f_{a} (h_{i}, c (w_{i})) = γ (c (w_{i})) ⊙ h_{i} + β (c (w_{i}))

γ (x) = W_{γ} x + b_{γ}, β (x) = W_{β} x + b_{β}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mourga/affective-attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Attention-based Conditioning Methods

for External Knowledge Integration

Katerina Margatina1, Christos Baziotis2 , Alexandros Potamianos1,3,4

1School of ECE, National Technical University of Athens, Athens, Greece

2 School of Informatics, University of Edinburgh, UK

3 Signal Analysis and Interpretation Laboratory (SAIL), USC, Los Angeles, USA

4 Behavioral Signal Technologies, Los Angeles, USA

[email protected], [email protected],

[email protected] The research was conducted when the author was a researcher at School of ECE, NTUA in Athens, Greece.

(2019)

Abstract

In this paper, we present a novel approach for incorporating external knowledge in Recurrent Neural Networks (RNNs). We propose the integration of lexicon features into the self-attention mechanism of RNN-based architectures. This form of conditioning on the attention distribution, enforces the contribution of the most salient words for the task at hand. We introduce three methods, namely attentional concatenation, feature-based gating and affine transformation. Experiments on six benchmark datasets show the effectiveness of our methods. Attentional feature-based gating yields consistent performance improvement across tasks. Our approach is implemented as a simple add-on module for RNN-based models with minimal computational overhead and can be adapted to any deep neural architecture.

1 Introduction

Modern deep learning algorithms often do away with feature engineering and learn latent representations directly from raw data that are given as input to Deep Neural Networks (DNNs) Mikolov et al. (2013); McCann et al. (2017); Peters et al. (2018). However, it has been shown that linguistic knowledge (manually or semi-automatically encoded into lexicons and knowledge bases) can significantly improve DNN performance for Natural Language Processing (NLP) tasks, such as natural language inference Mrkšić et al. (2017), language modelling Ahn et al. (2016), named entity recognition Ghaddar and Langlais (2018) and relation extraction Vashishth et al. (2018).

For NLP tasks, external sources of information are typically incorporated into deep neural architectures by processing the raw input in the context of such external linguistic knowledge. In machine learning, this contextual processing is known as conditioning; the computation carried out by a model is conditioned or modulated by information extracted from an auxiliary input. The most commonly-used method of conditioning is concatenating a representation of the external information to the input or hidden network layers.

Attention mechanisms Bahdanau et al. (2015); Vaswani et al. (2017); Lin et al. (2017) are a key ingredient for achieving state-of-the-art performance in tasks such as textual entailment Rocktäschel et al. (2016), question answering Xiong et al. (2017), and neural machine translation Wu et al. (2016). Often task-specific attentional architectures are proposed in the literature to further improve DNN performance Dhingra et al. (2017); Xu et al. (2015); Barrett et al. (2018).

In this work, we propose a novel way of utilizing word-level prior information encoded in linguistic, sentiment, and emotion lexicons, to improve classification performance. Usually, lexicon features are concatenated to word-level representations Wang et al. (2016); Yang et al. (2017); Trotzek et al. (2018), as additional features to the embedding of each word or the hidden states of the model. By contrast, we propose to incorporate them into the self-attention mechanism of RNNs. Our goal is to enable the self-attention mechanism to identify the most informative words, by directly conditioning on their additional lexicon features.

Our contributions are the following: (1) we propose an alternative way for incorporating external knowledge to RNN-based architectures, (2) we present empirical results that our proposed approach consistently outperforms strong baselines, and (3) we report state-of-the-art performance in two datasets. We make our source code publicly available111https://github.com/mourga/affective-attention.

2 Related Work

In the traditional machine learning literature where statistical models are based on sparse features, affective lexicons have been shown to be highly effective for tasks such as sentiment analysis, as they provide additional information not captured in the raw training data Hu and Liu (2004); Kim and Hovy (2004); Ding et al. (2008); Yu and Dredze (2014); Taboada et al. (2011). After the emergence of pretrained word representations Mikolov et al. (2013); Pennington et al. (2014), the use of lexicons is no longer common practice, since word embeddings can also capture some of the affective meaning of these words.

Recently, there have been notable contributions towards integrating linguistic knowledge into DNNs for various NLP tasks. For sentiment analysis, Teng et al. (2016) integrate lexicon features to an RNN-based model with a custom weighted-sum calculation of word features. Shin et al. (2017) propose three convolutional neural network specific methods of lexicon integration achieving state-of-the-art performance on two datasets. Kumar et al. (2018) concatenate features from a knowledge base to word representations in an attentive bidirectional LSTM architecture, also reporting state-of-the-art results. For sarcasm detection, Yang et al. (2017) incorporate psycholinguistic, stylistic, structural, and readability features by concatenating them to paragraph and document-level representations.

Furthermore, there is limited literature regarding the development and evaluation of methods for combining representations in deep neural networks. Peters et al. (2017) claim that concatenation, non-linear mapping and attention-like mechanisms are unexplored methods for including language model representations in their sequence model. They employ simple concatenation, leaving the exploration of other methods to future work. Dumoulin et al. (2018) provide an overview of feature-wise transformations such as concatenation-based conditioning, conditional biasing and gating mechanisms. They review the effectiveness of conditioning methods in tasks such as visual question answering Strub et al. (2018), style transfer Dumoulin et al. (2017) and language modeling Dauphin et al. (2017). They also extend the work by Perez et al. (2017), which proposes the Feature-wise Linear Modulation (FiLM) framework, and investigate its applications in visual reasoning tasks. Balazs and Matsuo (2019) provide an empirical study showing the effects of different ways of combining character and word representations in word-level and sentence-level evaluation tasks. Some of the reported findings are that gating conditioning performs consistently better across a variety of word similarity and relatedness tasks.

3 Proposed Model

3.1 Network Architecture

Word Embedding Layer. The input sequence of words $w_{1},w_{2},...,w_{T}$ is projected to a low-dimensional vector space $R^{W}$ , where $W$ is the size of the embedding layer and $T$ the number of words in a sentence. We initialize the weights of the embedding layer with pretrained word embeddings.

LSTM Layer. A Long Short-Term Memory unit (LSTM) Hochreiter and Schmidhuber (1997) takes as input the words of a sentence and produces the word annotations $h_{1},h_{2},...,h_{T}$ , where $h_{i}$ is the hidden state of the LSTM at time-step $i$ , summarizing all sentence information up to $w_{i}$ .

Self-Attention Layer. We use a self-attention mechanism Cheng et al. (2016) to find the relative importance of each word for the task at hand. The attention mechanism assigns a score $a_{i}$ to each word annotation $h_{i}$ . We compute the fixed representation $r$ of the input sequence, as the weighted sum of all the word annotations. Formally:

[TABLE]

where $f(.)$ corresponds to a non-linear transformation $tanh(W_{a}h_{i}+b_{a})$ and $W_{a},b_{a},v_{a}$ are the parameters of the attention layer.

3.2 External Knowledge

In this work, we augment our models with existing linguistic and affective knowledge from human experts. Specifically, we leverage lexica containing psycho-linguistic, sentiment and emotion annotations. We construct a feature vector $c(w_{i})$ for every word in the vocabulary by concatenating the word’s annotations from the lexicons shown in Table 1. For missing words we append zero in the corresponding dimension(s) of $c(w_{i})$ .

3.3 Conditional Attention Mechanism

We extend the standard self-attention mechanism (Eq. 1, 2), in order to condition the attention distribution of a given sentence, on each word’s prior lexical information. To this end, we use as input to the self-attention layer both the word annotation $h_{i}$ , as well as the lexicon feature $c(w_{i})$ of each word. Therefore, we replace $f(h_{i})$ in Eq. 1 with $f(h_{i},c(w_{i}))$ . Specifically, we explore three conditioning methods, which are illustrated in Figure 1. We refer to the conditioning function as $f_{i}(.)$ , the weight matrix as $W_{i}$ and the biases as $b_{i}$ , where $i$ is an indicative letter for each method. We present our results in Section 5 (Table 3) and we denote the three conditioning methods as “conc.”, “gate” and “affine” respectively.

Attentional Concatenation. In this approach, as illustrated in Fig. 1(a), we learn a function of the concatenation of each word annotation $h_{i}$ with its lexicon features $c(w_{i})$ . The intuition is that by adding extra dimensions to $h_{i}$ , learned representations are more discriminative. Concretely:

[TABLE]

where $\parallel$ denotes the concatenation operation and $W_{c},b_{c}$ are learnable parameters.

Attentional Feature-based Gating. The second approach, illustrated in Fig. 1(b), learns a feature mask, which is applied on each word annotation $h_{i}$ . Specifically, a gate mechanism with a sigmoid activation function, generates a mask-vector from each $c(w_{i})$ with values between 0 and 1 (black and white dots in Fig. 1(b)). Intuitively, this gating mechanism selects salient dimensions (i.e. features) of $h_{i}$ , conditioned on the lexical information. Formally:

[TABLE]

where $\odot$ denotes element-wise multiplication and $W_{g},b_{g}$ are learnable parameters.

Attentional Affine Transformation. The third approach, shown in Fig. 1(c), is adopted from the work of Perez et al. (2017) and applies a feature-wise affine transformation to the latent space of the hidden states. Specifically, we use the lexicon features $c(w_{i})$ , in order to conditionally generate the corresponding scaling $\gamma(\cdot)$ and shifting $\beta(\cdot)$ vectors. Concretely:

[TABLE]

where $W_{\gamma},W_{\beta},b_{\gamma},b_{\beta}$ are learnable parameters.

3.4 Baselines

We employ two baselines: The first baseline is an LSTM-based architecture augmented with a self-attention mechanism (Sec. 3.1) with no external knowledge. The second baseline incorporates lexicon information by concatenating the $c(w_{i})$ vectors to the word representations in the embedding layer. In Table 3 we use the abbreviations “baseline” and “emb. conc.” for the two baseline models respectively.

4 Experiments

Lexicon Features. As prior knowledge, we leverage the lexicons presented in Table 1. We selected widely-used lexicons that represent different facets of affective and psycho-linguistic features, namely; LIWC Tausczik and Pennebaker (2010), Bing Liu Opinion Lexicon Hu and Liu (2004), AFINN Nielsen (2011), Subjectivity Lexicon Wilson et al. (2005), SemEval 2015 English Twitter Lexicon Svetlana Kiritchenko and Mohammad (2014), and NRC Emotion Lexicon (EmoLex) Mohammad and Turney (2013).

Datasets. The proposed framework can be applied to different domains and tasks. In this paper, we experiment with sentiment analysis, emotion recognition, irony, and sarcasm detection. Details of the benchmark datasets are shown in Table 2.

Preprocessing. To preprocess the words, we use the tool $Ekphrasis$ (Baziotis et al., 2017). After tokenization, we map each word to the corresponding pretrained word representation: Twitter-specific word2vec embeddings Chronopoulou et al. (2018) for the Twitter datasets, and fasttext Bojanowski et al. (2017) for the rest.

Experimental Setup. For all methods, we employ a single-layer LSTM model with 300 neurons augmented with a self-attention mechanism, as described in Section 3. As regularization techniques, we apply early stopping, Gaussian noise $N(0,0.1)$ to the word embedding layer, and dropout to the LSTM layer with $p=0.2$ . We use Adam to optimize our networks Kingma and Ba (2014) with mini-batches of size 64 and clip the norm of the gradients Pascanu et al. (2013) at 0.5, as an extra safety measure against exploding gradients. We also use PyTorch Paszke et al. (2017) and scikit-learn Pedregosa et al. (2011).

5 Results & Analysis

We compare the performance of the three proposed conditioning methods with the two baselines and the state-of-the-art in Table 3. We also provide results for the combination of our best method, attentional feature-based gating, and the second baseline model (emb. conc.).

The results show that incorporating external knowledge in RNN-based architectures consistently improves performance over the baseline for all datasets. Furthermore, feature-based gating improves upon baseline concatenation in the embedding layer across benchmarks, with the exception of PsychExp dataset.

For the Sent17 dataset we achieve state-of-the-art $F_{1}$ score using the feature-based gating method; we further improve performance when combining gating with the emb. conc. method. For SST-5, we observe a significant performance boost with combined attentional gating and embedding conditioning (gate + emb. conc.). For PsychExp, we marginally outperform the state-of-the-art also with the combined method, while for $Irony18$ , feature-based gating yields the best results. Finally, concatenation based conditioning is the top method for $SCv1$ , and the combination method for $SCv2$ .

Overall, attentional feature-based gating is the best performing conditioning method followed by concatenation. Attentional affine transformation underperforms, especially, for smaller datasets; this is probably due to the higher capacity of this model. This is particularly interesting since gating (Eq. 4) is a special case of the affine transformation method (Eq. 5), where the shifting vector $\beta$ is zero and the scaling vector $\gamma$ is bounded to the range $[0,1]$ (Eq. 6). Interestingly, the combination of gating with traditional embedding-layer concatenation gives additional performance gains for most tasks, indicating that there are synergies to exploit in various conditioning methods.

In addition to the performance improvements, we can visually evaluate the effect of conditioning the attention distribution on prior knowledge and improve the interpretability of our approach. As we can see in Figure 2, lexicon features guide the model to attend to more salient words and thus predict the correct class.

6 Conclusions & Future work

We introduce three novel attention-based conditioning methods and compare their effectiveness with traditional concatenation-based conditioning. Our methods are simple, yet effective, achieving consistent performance improvement for all datasets. Our approach can be applied to any RNN-based architecture as a extra module to further improve performance with minimal computational overhead.

In the future, we aim to incorporate more elaborate linguistic resources (e.g. knowledge bases) and to investigate the performance of our methods on more complex NLP tasks, such as named entity recognition and sequence labelling, where prior knowledge integration is an active area of research.

Acknowledgements

We would like to thank our colleagues Alexandra Chronopoulou and Georgios Paraskevopoulos for their helpful suggestions and comments. This work has been partially supported by computational timegranted from the Greek Research & Technology Network (GR-NET) in the National HPC facility - ARIS. We thank NVIDIA for supporting this work by donating a TitanX GPU.

Bibliography60

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ahn et al. (2016) Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and Yoshua Bengio. 2016. A neural knowledge language model . ar Xiv preprint ar Xiv:1608.00318 .
2Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate . In Proceedings of the International Conference on Learning Representations .
3Balazs and Matsuo (2019) Jorge Balazs and Yutaka Matsuo. 2019. Gating mechanisms for combining character and word-level word representations: an empirical study . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop , pages 110–124, Minneapolis, Minnesota. Association for Computational Linguistics.
4Barrett et al. (2018) Maria Barrett, Joachim Bingel, Nora Hollenstein, Marek Rei, and Anders Søgaard. 2018. Sequence classification with human attention . In Proceedings of the Conference on Computational Natural Language Learning , pages 302–312.
5Baziotis et al. (2018) Christos Baziotis, Athanasiou Nikolaos, Pinelopi Papalampidi, Athanasia Kolovou, Georgios Paraskevopoulos, Nikolaos Ellinas, and Alexandros Potamianos. 2018. Ntua-slp at semeval-2018 task 3: Tracking ironic tweets using ensembles of word and character level attentive rnns . In Proceedings of the International Workshop on Semantic Evaluation , pages 613–621.
6Baziotis et al. (2017) Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. 2017. Datastories at semeval-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis . In Proceedings of the International Workshop on Semantic Evaluation , pages 747–754.
7Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information . Transactions of the Association for Computational Linguistics , 5:135–146.
8Cheng et al. (2016) Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading . In Proceedings of the Conference on Empirical Methods in Natural Language Processing , pages 551–561.