Latent Variable Sentiment Grammar

Liwen Zhang; Kewei Tu; Yue Zhang

arXiv:1907.00218·cs.CL·July 9, 2019

Latent Variable Sentiment Grammar

Liwen Zhang, Kewei Tu, Yue Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a sentiment grammar framework that explicitly models sentiment composition using latent variables and Gaussian mixtures, improving neural sentiment classifiers on the SST benchmark.

Contribution

It proposes two formal models for deep sentiment representation that explicitly encode sentiment subtypes, enhancing neural sentiment analysis.

Findings

01

Sentiment grammar outperforms vanilla neural encoders.

02

Using ELMo embeddings yields state-of-the-art results.

03

Models effectively capture sentiment subtype expressions.

Abstract

Neural models have been investigated for sentiment classification over constituent trees. They learn phrase composition automatically by encoding tree structures but do not explicitly model sentiment composition, which requires to encode sentiment class labels. To this end, we investigate two formalisms with deep sentiment representations that capture sentiment subtype expressions by latent variables and Gaussian mixture vectors, respectively. Experiments on Stanford Sentiment Treebank (SST) show the effectiveness of sentiment grammar over vanilla neural encoders. Using ELMo embeddings, our method gives the best results on this benchmark.

Tables3

Table 1. Table 1: Equation 10 - 12 calculate the inside score, outside score and production rule score for LVG, respectively. Equation 13 - 15 is used for LVeG. Equation 10 and Equation 13 are the inside score functions of a sentiment polarity A 𝐴 A over its span 𝒘 i : j subscript 𝒘 : 𝑖 𝑗 \bm{w}_{i:j} in the sentence 𝒘 1 : n subscript 𝒘 : 1 𝑛 \bm{w}_{1:n} . Equation 1 and Equation 1 are the outside score functions of a sentiment polarity A 𝐴 A over a span 𝒘 i : j subscript 𝒘 : 𝑖 𝑗 \bm{w}_{i:j} in the sentence 𝒘 1 : n subscript 𝒘 : 1 𝑛 \bm{w}_{1:n} . Equation 12 and Equation 15 : the production rule score function of a rule A → B C → 𝐴 𝐵 𝐶 A\shortrightarrow BC with sentiment polarities A 𝐴 A , B 𝐵 B , and C 𝐶 C spanning words 𝒘 i : j subscript 𝒘 : 𝑖 𝑗 \bm{w}_{i:j} , 𝒘 i , k − 1 subscript 𝒘 𝑖 𝑘 1 \bm{w}_{i,k-1} , and 𝒘 k + 1 : j subscript 𝒘 : 𝑘 1 𝑗 \bm{w}_{k+1:j} respectively. Here we use lower case letters a 𝑎 a , b 𝑏 b , c … 𝑐 … c\dots represent discrete subtypes of sentiment polarities A 𝐴 A , B 𝐵 B , C … 𝐶 … C\dots in LVG and use bold lower case letters 𝒂 𝒂 \bm{a} , 𝒃 𝒃 \bm{b} , 𝒄 … 𝒄 … \bm{c}\dots represent continuous subtypes of sentiment polarities in LVeG. Note that spans such as 𝒘 i : j subscript 𝒘 : 𝑖 𝑗 \bm{w}_{i:j} mentioned above are all given by the skeleton K 𝐾 K of sentence 𝒘 1 : n subscript 𝒘 : 1 𝑛 \bm{w}_{1:n} .

$s_{I}^{A} (a, i, j) =$ $\sum_{A \to B C \in R_{t}}$ $\sum_{b \in N} \sum_{c \in N} W_{A \to 𝒘_{i : j}} (a) W_{A \to B C} (a, b, c) \times s_{I}^{B} (b, i, k) s_{I}^{C} (c, k + 1, j) .$ (10) $s_{O}^{A} (a, i, j) =$ $\sum_{B \to C A \in R_{t}}$ $\sum_{b \in N} \sum_{c \in N} W_{A \to 𝒘_{i : j}} (a) W_{B \to C A} (b, c, a) \times s_{O}^{B} (b, k, j) s_{I}^{C} (c, k, i - 1)$ $+$ $\sum_{B \to A C \in R_{t}}$ $\sum_{b \in N} \sum_{c \in N} W_{A \to 𝒘_{i : j}} (a) W_{B \to A C} (b, a, c) \times s_{O}^{B} (b, i, k) s_{I}^{C} (c, j + 1, k) .$ (11)
$s (A \to B C, i, k, j) = \sum_{a \in N} \sum_{b \in N} \sum_{c \in N} W_{A \to B C} (a, b, c) \times s_{O}^{A} (a, i, j) \times s_{I}^{B} (b, i, k) \times s_{I}^{C} (c, k + 1, j) .$ (12)
$s_{I}^{A} (𝒂, i, j) =$ $\sum_{A \to B C \in R_{t}}$ $\iint W_{A \to 𝒘_{i : j}} (𝒂) W_{A \to B C} (𝒂, 𝒃, 𝒄) \times s_{I}^{B} (𝒃, i, k) s_{I}^{C} (𝒄, k + 1, j) 𝑑 𝒃 𝑑 𝒄 .$ (13) $s_{O}^{A} (𝒂, i, j) =$ $\sum_{B \to C A \in R_{t}}$ $\iint W_{A \to 𝒘_{i : j}} (𝒂) W_{B \to C A} (𝒃, 𝒄, 𝒂) \times s_{O}^{B} (𝒃, k, j) s_{I}^{C} (𝒄, k, i - 1) 𝑑 𝒃 𝑑 𝒄$ $+$ $\sum_{B \to A C \in R_{t}}$ $\iint W_{A \to 𝒘_{i : j}} (𝒂) W_{B \to A C} (𝒃, 𝒂, 𝒄) \times s_{O}^{B} (𝒃, i, k) s_{I}^{C} (𝒄, j + 1, k) 𝑑 𝒃 𝑑 𝒄 .$ (14)
$s (A \to B C, i, k, j) = ∭ W_{A \to B C} (𝒂, 𝒃, 𝒄) \times s_{O}^{A} (𝒂, i, j) \times s_{I}^{B} (𝒃, i, k) \times s_{I}^{C} (𝒄, k + 1, j) 𝑑 𝒂 𝑑 𝒃 𝑑 𝒄 .$ (15)

Table 2. Table 2: Experimental results with constituent Tree-LSTMs.

Model	SST-5 Root	SST-5 Phrase	SST-2 Root	SST-2 Phrase
ConTree Le and Zuidema (2015)	49.9	-	88.0	-
ConTree Tai et al. (2015)	51.0	-	88.0	-
ConTree Zhu et al. (2015)	50.1	-	-	-
ConTree Li et al. (2015)	50.4	83.4	86.7	-
ConTree (Our implementation)	51.5	82.8	89.4	86.9
ConTree + WG	51.7	83.0	89.7	88.9
ConTree + LVG4	52.2	83.2	89.8	89.1
ConTree + LVeG	52.9	83.4	89.8	89.5

Table 3. Table 3: Experimental results with ELMo. BCN(P) is the BCN implemented by Peters et al. ( 2018 ) . BCN(O) is the BCN implemented by ourselves.

Model	SST-5		SST-2
Model	Root	Phrase	Root	Phrase
BCN(P)	54.7	-	-	-
BCN(O)	54.6	83.3	91.4	88.8
BCN+WG	55.1	83.5	91.5	90.5
BCN+LVG4	55.5	83.5	91.7	91.3
BCN+LVeG	56.0	83.5	92.1	91.6

Equations31

i f_{l} f_{r} o g = σ σ σ σ tanh W_{t} x h_{l} h_{r} + b_{t}

i f_{l} f_{r} o g = σ σ σ σ tanh W_{t} x h_{l} h_{r} + b_{t}

c_{p} = i \otimes g + f_{l} \otimes c_{l} + f_{r} \otimes c_{r}

c_{p} = i \otimes g + f_{l} \otimes c_{l} + f_{r} \otimes c_{r}

h_{p} = o \otimes tanh (c_{p})

h_{p} = o \otimes tanh (c_{p})

S (T ∣ w, K) = r_{t} \in T \prod W_{n} (r_{t}) \times r_{e} \in T \prod W_{e} (r_{e})

S (T ∣ w, K) = r_{t} \in T \prod W_{n} (r_{t}) \times r_{e} \in T \prod W_{e} (r_{e})

W_{A \shortrightarrow w_{i : j}} = exp (f_{A} (h_{i : j}))

W_{A \shortrightarrow w_{i : j}} = exp (f_{A} (h_{i : j}))

W_{r} (r) = k = 1 \sum K_{r} ρ_{r, k} N (r ∣ μ_{r, k}, Σ_{r, k})

W_{r} (r) = k = 1 \sum K_{r} ρ_{r, k} N (r ∣ μ_{r, k}, Σ_{r, k})

ρ_{r, k}

ρ_{r, k}

μ_{r, k}

Σ_{r, k}

T^{*} = T \in G (w, K) argmax P (T ∣ w, K)

T^{*} = T \in G (w, K) argmax P (T ∣ w, K)

P (T ∣ w, K) = \frac{S ( T ∣ w , K )}{\sum _{\hat{T} \in K} S ( T ^ ∣ w , K )} .

P (T ∣ w, K) = \frac{S ( T ∣ w , K )}{\sum _{\hat{T} \in K} S ( T ^ ∣ w , K )} .

q (A \shortrightarrow B C, i, k, j) = \frac{s ( A \shortrightarrow , B C , i , k , j )}{s _{I} ( S , 1 , n )},

q (A \shortrightarrow B C, i, k, j) = \frac{s ( A \shortrightarrow , B C , i , k , j )}{s _{I} ( S , 1 , n )},

T_{q}^{*} = T_{q} \in G (w, K) argmax e \in T \prod q (e)

T_{q}^{*} = T_{q} \in G (w, K) argmax e \in T \prod q (e)

L (Θ) = - lo g i = 1 \prod m P_{Θ} (T_{i} ∣ w_{i}, K_{i}),

L (Θ) = - lo g i = 1 \prod m P_{Θ} (T_{i} ∣ w_{i}, K_{i}),

\frac{\partial L ( Θ )}{\partial W _{r}} = i = 1 \sum m (E_{Θ} [f_{r} (t) ∣ T_{i}] - E_{Θ} [f_{r} (t) ∣ w_{i}]),

\frac{\partial L ( Θ )}{\partial W _{r}} = i = 1 \sum m (E_{Θ} [f_{r} (t) ∣ T_{i}] - E_{Θ} [f_{r} (t) ∣ w_{i}]),

\frac{\partial L ( Θ )}{\partial Θ _{r}} =

\frac{\partial L ( Θ )}{\partial Θ _{r}} =

\times

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Ehaschia/bi-tree-lstm-crf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM · Softmax · ELMo

Full text

Latent Variable Sentiment Grammar††thanks: Work was done when the first author was visiting Westlake University. The third author is the corresponding author.

Liwen Zhang†, Kewei Tu†, Yue Zhang*‡⋄*

†School of Information Science and Technology,ShanghaiTech University, Shanghai, China

‡Institute of Advanced Technology, Westlake Institute for Advanced Study, China

⋄School of Engineering, Westlake University, Hangzhou, China

{zhanglw1,tukw}@shanghaitech.edu.cn

[email protected]

Abstract

Neural models have been investigated for sentiment classification over constituent trees. They learn phrase composition automatically by encoding tree structures but do not explicitly model sentiment composition, which requires to encode sentiment class labels. To this end, we investigate two formalisms with deep sentiment representations that capture sentiment subtype expressions by latent variables and Gaussian mixture vectors, respectively. Experiments on Stanford Sentiment Treebank (SST) show the effectiveness of sentiment grammar over vanilla neural encoders. Using ELMo embeddings, our method gives the best results on this benchmark.

1 Introduction

Determining the sentiment polarity at or below the sentence level is an important task in natural language processing. Sequence structured models Li et al. (2015); McCann et al. (2017) have been exploited for modeling each phrase independently. Recently, tree structured models Zhu et al. (2015); Tai et al. (2015); Teng and Zhang (2017) were leveraged for learning phrase compositions in sentence representation given the syntactic structure. Such models classify the sentiment over each constituent node according to its hidden vector through tree structure encoding.

Though effective, existing neural methods do not consider explicit sentiment compositionality Montague (1974). Take the sentence “The movie is not very good, but I still like it” in Figure 1 as example Dong et al. (2015), over the constituent tree, sentiment signals can be propagated from leaf nodes to the root, going through negation, intensification and contrast according to the context. Modeling such signal channels can intuitively lead to more interpretable and reliable results. To model sentiment composition, direct encoding of sentiment signals (e.g., +1/-1 or more fine-grained forms) is necessary.

To this end, we consider a neural network grammar with latent variables. In particular, we employ a grammar as the backbone of our approach in which nonterminals represent sentiment signals and grammar rules specify sentiment compositions. In the simplest version of our approach, nonterminals are sentiment labels from SST directly, resulting in a weighted grammar. To model more fine-grained emotions Ortony and Turner (1990), we consider a latent variable grammar (LVG, Matsuzaki et al. (2005), Petrov et al. (2006)), which splits each nonterminal into subtypes to represent subtle sentiment signals and uses a discrete latent variable to denote the sentiment subtype of a phrase. Finally, inspired by the fact that sentiment can be modeled with a low dimensional continuous space Mehrabian (1980), we introduce a Gaussian mixture latent vector grammar (GM-LVeG, Zhao et al. (2018)), which associates each sentiment signal with a continuous vector instead of a discrete variable.

Experiments on SST show that explicit modeling of sentiment composition leads to significantly improved performance over standard tree encoding, and models that learn subtle emotions as hidden variables give better results than coarse-grained models. Using a bi-attentive classification network Peters et al. (2018) as the encoder, out final model gives the best results on SST. To our knowledge, we are the first to consider neural network grammars with latent variables for sentiment composition. Our code will be released at https://github.com/Ehaschia/bi-tree-lstm-crf.

2 Related Work

Phrase-level sentiment analysis

Li et al. (2015) and McCann et al. (2017) proposed sequence structured models that predict the sentiment polarities of the individual phrases in a sentence independently. Zhu et al. (2015), Le and Zuidema (2015), Tai et al. (2015) and Gupta and Zhang (2018) proposed Tree-LSTM models to capture bottom-up dependencies between constituents for sentiment analysis. In order to support information flow bidirectionally over trees, Teng and Zhang (2017) introduced a Bi-directional Tree-LSTM model that adds a top-down component after Tree-LSTM encoding. These models handle sentiment composition implicitly and predict sentiment polarities only based on embeddings of current nodes. In contrast, we model sentiment explicitly.

Sentiment composition

Moilanen and Pulman (2007) introduced a seminal model for sentiment composition Montague (1974), composed positive, negative and neutral (+1/-1/0) singles hierarchically. Taboada et al. (2011) proposed a lexicon-based method for addressing sentence level contextual valence shifting phenomena such as negation and intensification. Choi and Cardie (2008) used a structured linear model to learn semantic compositionality relying on a set of manual features. Dong et al. (2015) developed a statistical parser to learn the sentiment structure of a sentence. Our method is similar in that grammars are used to model semantic compositionality. But we consider neural methods instead of statistical methods for sentiment composition. Teng et al. (2016) proposed a simple weighted-sum model of introducing sentiment lexicon features to LSTM for sentiment analysis. They used -2 to 2 represent sentiment polarities. In contrast, we model sentiment subtypes with latent variables and combine the strength of neural encoder and hierarchical sentiment composition.

Latent Variable Grammar

There has been a line of work using discrete latent variables to enrich coarse-grained constituent labels in phrase-structure parsing Johnson (1998); Matsuzaki et al. (2005); Petrov et al. (2006); Petrov and Klein (2007). Our work is similar in that discrete latent variables are used to model sentiment polarities. To our knowledge, we are the first to consider modeling fine-grained sentiment signals by investigating different types of latent variables. Recently, there has been work using continuous latent vectors for modeling syntactic categories Zhao et al. (2018). We consider their grammar also in modeling sentiment polarities.

3 Baseline

We take the constituent Tree-LSTM as our baseline, which extends sequential LSTM to tree-structured network topologies. Formally, our model computes a parent representation from its two children in a Tree-LSTM:

[TABLE]

where $\bm{W}_{t}\in\mathbb{R}^{5D_{h}\times 3D_{h}}$ and $\bm{b}_{t}\in\mathbb{R}^{3D_{h}}$ are trainable parameters, $\otimes$ is the Hadamard product and $\bm{x}$ represents the input of leaf node. Our formulation is a special case of the $N$ -ary Tree-LSTM Tai et al. (2015) with $N=2$ .

Existing work (Tai et al. (2015), Zhu et al. (2015)) performs softmax classification on each node according to the state vetcor $\bm{h}$ on each node for sentiment analysis. We follow this method in our baseline model.

4 Sentiment Grammars

We investigate sentiment grammars as a layer of structured representation on top of a tree-LSTM, which model the correlation between different sentiment labels over a tree. Depending on how a sentiment label is represented, we develop three increasingly complex models. In particular, the first model, which is introduced in Section 4.1, uses a weighted grammar to model the first-order correlation between sentiment labels in a tree. It can be regarded as a PCFG model. The second model, which is introduced in Section 4.2, introduces a discrete latent variable for a refined representation of sentiment classes. Finally, the third model, which is introduced in Section 4.3, considers a continuous latent representation of sentiment classes.

4.1 Weighted Grammars

Formally, a sentiment grammar is defined as $\mathcal{G}=(N,S,\Sigma,R_{t},R_{e},W_{t},W_{e})$ , where $N=\{A,B,C,...\}$ is a finite set of sentiment polarities, $S\in N$ is the start symbol, $\Sigma$ is a finite set of terminal symbols representing words such that $N\cap\Sigma=\varnothing$ , $R_{t}$ is the transition rule set containing production rules of the form $X\shortrightarrow\alpha$ where $X\in N$ and $\alpha\in N^{+}$ ; $R_{e}$ is the emission rule set containing production rules of the form $X\shortrightarrow\bm{w}$ where $X\in N$ and $\bm{w}\in\Sigma^{+}$ . $W_{t}$ and $W_{e}$ are sets of weights indexed by production rules in $R_{t}$ and $R_{e}$ , respectively. Different from standard formal grammars, for each sentiment polarity in a parse tree our sentiment grammar invokes one emission rule to generate a string of terminals and invokes zero or one transition rule to product its child sentiment polarities. This is similar to the behavior of hidden Markov models. Therefore, in a parse tree each non-leaf node is a sentiment polarity and is connected to exactly one leaf node which is a string of terminals. The terminals that are connected to the parent node can be obtained by concatenating the leaf nodes of its child nodes. Figure 2 shows an example for our sentiment grammar. In this paper, we only consider $R_{t}$ in the Chomsky normal form (CNF) for clarity of presentation. However, it is straightforward to extend our formulation to the general case.

The score of a sentiment tree $T$ conditioned on a sentence $\bm{w}$ is defined as follows:

[TABLE]

where $r_{t}$ and $r_{e}$ represent a transition rule and an emission rule in sentiment parse tree $T$ , respectively. We specify the transition weights $W_{n}$ with a non-negative rank-3 tensor. We compute the non-negative weight of each emission rule $W_{e}(X\shortrightarrow\bm{w}_{i:j})$ by applying a single layer perceptron $f_{X}$ and an exponential function to the neural encoder state vector $\bm{h}_{i:j}$ representing the constituent $\bm{w}_{i:j}$ .

Sentiment grammars provides a principled way for explicitly modeling sentiment composition, and through parameterizing the emission rules with neural encoders, it can take the advantage of deep learning. In particular, by adding a weighted grammar on top of a tree-LSTM, our model is reminiscent of LSTM-CRF in the sequence structure.

4.2 Latent Variable Grammars

Inspired by categorical models Ortony and Turner (1990) which regard emotions as an overlay over a series of basic emotions, we extend our sentiment grammars with Latent Variable Grammars (LVGs; Petrov et al. (2006)), which refine each constituent tree node with a discrete latent variables, splitting each observed sentiment polarity into finite unobserved sentiment subtypes. We refer to trees over unsplit sentiment polarities as unrefined trees and trees over sentiment subtypes as refined trees.

Suppose that the sentiment polarities $A$ , $B$ and $C$ of a transition rule $A\shortrightarrow BC$ are split into $n_{A}$ , $n_{B}$ and $n_{C}$ subtypes, respectively. The weights of the refined transition rule can be represented by a non-negative rank-3 tensor $W_{A\shortrightarrow BC}\in\mathbb{R}^{n_{A}\times n_{B}\times n_{C}}$ . Similarly, given an emission rule $A\shortrightarrow\bm{w}_{i:j}$ , the weights of its refined rules by splitting $A$ into $n_{A}$ subtypes is a non-negative vector $W_{A\shortrightarrow\bm{w}_{i:j}}\in\mathbb{R}^{n_{A}}$ calculated by an exponential function and a single layer perceptron $f_{A}$ :

[TABLE]

where $\bm{h}_{i:j}$ is the vector representation of constituent $\bm{w}_{i:j}$ . The score of a refined parse tree is defined as the product of weights of all transition rules and emission rules that make up the refined parse tree, similar to Equation 4. The score of an unrefined parse tree is then defined as the sum of the scores of all refined trees that are consistent with it.

Note that Weighted Grammar (WG) can be viewed as a special case of LVGs where each sentiment polarity has one subtype.

4.3 Gaussian Mixture Latent Vector Grammars

Inspired by continuous models Mehrabian (1980) which model emotions in a continuous low dimensional space, we employ Latent Vector Grammars (LVeGs) Zhao et al. (2018) that associate each sentiment polarity with a latent vector space representing the set of sentiment subtypes. We follow the idea of Gaussian Mixture LVeGs (GM-LVeGs) Zhao et al. (2018), which uses Gaussian mixtures to model weight functions. Because Gaussian mixtures have the nice property of being closed under product, summation, and marginalization, learning and parsing can be done efficiently using dynamic programming

In GM-LVeG, the weight function of a transition or emission rule $r$ is defined as a Gaussian mixture with $K_{r}$ mixture components:

[TABLE]

where $\bm{r}$ is the concatenation of the latent vectors representing subtypes for sentiment polarities in rule $r$ , $\rho_{r,k}>0$ is the $k$ -th mixing weight (the $K_{r}$ mixture weights do not necessarily sum up to 1), and $\mathcal{N}(\bm{r}|\bm{\mu}_{r,k},\bm{\Sigma}_{r,k})$ denotes the $k$ -th Gaussian distribution parameterized by mean $\bm{\mu}_{r,k}$ and co-variance matrix $\bm{\Sigma}_{r,k}$ . For an emission rule $A\shortrightarrow\bm{w}_{i:j}$ , all the Gaussian mixture parameters are calculated by single layer perceptrons from the vector representation $\bm{h}_{i:j}$ of constituent $\bm{w}_{i:j}$ :

[TABLE]

For the sake of computational efficiency, we use Gaussian distributions with diagonal co-variance matrices.

4.4 Parsing

The goal of our task is to find the most probable sentiment parse tree $T^{*}$ , given a sentence $\bm{w}$ and its constituency parse tree skeleton $K$ . The polarity of the root node represents the polarity of the whole sentence, and the polarity of a constituent node is considered as the polarity of the phrase spanned by the node. Formally, $T^{*}$ is defined as:

[TABLE]

where $G(\bm{w},K)$ denotes the set of unrefined sentiment parse trees for $\bm{w}$ with skeleton $K$ . $P(T|w,K)$ is defined based on the parse tree score Equation 4:

[TABLE]

Note that unlike syntactic parsing, on SST we do not need to consider structural ambiguity, and thus resolving only rule ambiguity.

$T^{*}$ can be found using dynamic programming such as the CYK algorithm for WG. However, parsing becomes intractable with LVGs and LVeGs since we have to compute the score of an unrefined parse tree by summing over all of its refined versions. We use the best performing max-rule-product decoding algorithm Petrov et al. (2006); Petrov and Klein (2007) for approximate parsing, which searches for the parse tree that maximizes the product of the posteriors (or expected counts) of unrefined rules in the parse tree. The detailed procedure is described below, which is based on the classic inside-outside algorithm.

For LVGs, we first use dynamic programming to recursively calculate the inside score function $s_{\textbf{I}}^{A}(a,i,j)$ and outside score function $s_{\textbf{O}}^{A}(a,i,j)$ for each sentiment polarity over each span $\bm{w}_{i:j}$ consistent with skeleton $K$ using Equation 10 and Equation 1 in Table 1, respectively. Similarly for LVeGs, we recursively calculate inside score function $s_{\textbf{I}}^{A}(\bm{a},i,j)$ and outside score function $s_{\textbf{O}}^{A}(\bm{a},i,j)$ in LVeG are calculated by Equation 13 and Equation 1 in Table 1, in which we replace the sum of discrete variables in Equation 10-1 with the integral of continuous vectors. Next, using Equation 12 and Equation 15 in Table 1, we calculate the score $s(A\shortrightarrow BC,i,k,j)$ for LVG and LVeG, respectively, where $\left\langle{A\shortrightarrow BC,i,k,j}\right\rangle$ represents an anchored transition rule $A\shortrightarrow BC$ with $A$ , $B$ and $C$ spanning phrase $\bm{w}_{i:j}$ , $\bm{w}_{i,k-1}$ and $\bm{w}_{k+1:j}$ (all being consistent with skeleton $K$ ), respectively. The posterior (or expected count ) of $\left\langle{A\shortrightarrow BC,i,k,j}\right\rangle$ can be calculate as follows:

[TABLE]

where $s_{\textbf{I}}(S,1,n)$ is the inside score for the start symbol $S$ over the whole sentence $\bm{w}_{1:n}$ . Then we can run CYK algorithm to identify the parse tree that maximizes the product of rule posteriors. It’s objective function is given by:

[TABLE]

where $e$ ranges over all the transition rules in the sentiment parse tree $T$ .

Note that the equations in Table 1 are tailored for our sentiment grammars and differ from their standard versions in two aspects. First, we take into account the additional emission rules in the inside and outside computation; second, the parse tree skeleton is assumed given and hence the split point $k$ is prefixed in all the equations.

4.5 Learning

Given a training dataset $D=\{T_{i},\bm{w}_{i},K_{i}|i=1\dots m\}$ containing $m$ samples, where $T_{i}$ is the gold sentiment parse tree for the sentence $\bm{w}_{i}$ with its corresponding gold tree skeleton $K_{i}$ . The discriminative learning objective is to minimize the negative log conditional likelihood:

[TABLE]

where $\Theta$ represents the set of trainable parameters of our models. We optimize the objective with gradient-based methods. In particular, gradients are first calculated over the sentiment grammar layer, before being back-propagated to the tree-LSTM layer.

The gradient computation for the three models involves computing expected counts of rules, which has been described in Section 4.4. For WG and LVG, the derivative of $W_{r}$ , the parameter of an unrefined production rule $r$ is:

[TABLE]

where $\mathbb{E}_{\Theta}[f_{r}(t)|T_{i}]$ denotes the expected count of the unrefined production rule $r$ with respect to $P_{\Theta}$ in the set of refined trees $t$ , which are consistent with the observed parse tree $T$ . Similarly, we use $\mathbb{E}_{\Theta}[f_{r}(t)|\bm{w}_{i}]$ for the expectation over all derivations of the sentence $\bm{w}_{i}$ .

For LVeG, the derivative with respect to $\Theta_{r}$ , the parameters of the weight function $W_{r}(\bm{r})$ of an unrefined production rule $r$ is:

[TABLE]

The two expectations in Equation 19 and 20 can be efficiently computed using the inside-outside algorithm in Table 1. The derivative of the parameters of neural encoder can be derived from the derivative of the parameters of the emission rules.

5 Experiments

To investigate the effectiveness of modeling sentiment composition explicitly and using discrete variables or continuous vectors to model sentiment subtypes, we compare standard constituent Tree-LSTM (ConTree) with our models ConTree+WG, ConTree+LVG and ConTree+LVeG, respectively. To show the universality of our approaches, we also experiment with the combination of a state-of-the-art sequence structured model, bi-attentive classification network (BCN, Peters et al. (2018)), with our model: BCN+WG, BCN+LVG and BCN+LVeG.

5.1 Data

We use Stanford Sentiment TreeBank (SST, Socher et al. (2013)) for our experiments. Each constituent node in a phrase-structured tree is manually assigned an integer sentiment polarity from 0 to 4, which correspond to five sentiment classes: very negative, negative, neutral, positive and very positive, respectively. The root label represents the sentiment label of the whole sentence. The constituent node label represents the sentiment label of the phrase it spans. We perform both binary classification (-1, 1) and fine-grained classification (0-4), called SST-2 and SST-5, respectively. Following previous work, we use the labels of all phrases and gold-standard tree structures for training and testing. For binary classification, we merge all positive labels and negative labels.

5.2 Experimental Settings

Hyper-parameters

For ConTree, word vectors are initialized using Glove Pennington et al. (2014) 300-dimensional embeddings and are updated together with other parameters. We set the hidden size of hidden units is 300. Adam Kingma and Ba (2014) is used to optimize the parameters with learning rate is 0.001. We adopt Dropout after the Embedding layer with a probability of 0.5. The sentence level mini-batch size is 32. For BCN experiment, we follow the model setting in McCann et al. (2017) except the sentence level mini-batch is set to 8.

5.3 Development Experiments

We use the SST development dataset to investigate different configurations of our latent variables and Gaussian mixtures. The best performing parameters on the development set are used in all following experiments.

LVG subtype numbers

To explore the suitable number of latent variables to model subtypes of a sentiment polarity, we evaluate our ConTree+LVG model with different number of latent variables from 1 to 8. Figure 6(a) shows that there is an upward trend while the number of hidden variables $n$ increases from 1 to 4. After reaching the peak when $n=4$ , the accuracy decreases as the number of latent variable continue to increase. We thus choose $n=4$ for remaining experiments.

LVeG Gaussian dimensions

We investigate the influence of the latent vector dimension on the accuracy for ConTree+LVeG. The component number of Gaussian mixtures is fixed to 1, Figure 6(b) illuminates that as the dimension increases from 1 to 8, there is a rise of accuracy from 1 to 2, followed by a decrease from 2 to 8. Thus we set the Gaussian dimension to 2 for remaining experiments.

LVeG Gaussian mixture component numbers

Future 6(c) shows the performance of different component numbers with fixing the Gaussian dimension to 2. With the increase of Gaussian component number, the fine-grained sentence level accuracy declines slowly. The best performance is obtained when the component number $K_{r}=1$ , which we choose for remaining experiments.

5.4 Main Results

We re-implement constituent Tree-LSTM (ConTree) of Tai et al. (2015) and obtain better results than their original implementation. We then integrate ConTree with Weighted Grammars (ConTree+WG), Latent Variable Grammars with a subtype number of 4 (ConTree+LVG4), and Latent Variable Grammars (ConTree+LVeG), respectively. Table 2 shows the experimental results for sentiment classification on both SST-5 and SST-2 at the sentence level (Root) and all nodes (Phrase).

The performance improvement of ConTree+WG over ConTree reflects the benefit of handling sentiment composition explicitly. Particularly the phrase level binary classification task, ConTree+WG improves the accuracy by 2 points.

Compared with ConTree+WG, ConTree+LVG4 improves the fine-grained sentence level accuracy by 0.5 point, which demonstrates the effectiveness of modeling the sentiment subtypes with discrete variables. Similarly, incorporating Latent Vector Grammar into the constituent Tree-LSTM, the performance improvements, especially on the sentence level SST-5, demonstrate the effectiveness of modeling sentiment subtypes with continuous vectors. The performance improvements of ConTree+LVeG over ConTree+LVG4 show the advantage of infinite subtypes over finite subtypes.

There has also been work using large-scale external datasets to improve performances of sentiment classification. Peters et al. (2018) combined bi-attentive classification network (BCN, McCann et al. (2017)) with a pretrained language model with character convolutions on a large-scale corpus (ELMo) and reported an accuracy of 54.7 on sentence-level SST-5. For fair comparison, we also augment our model with ELMo. Table 3 shows that our methods beat the baseline on every task. BCN+WG improves accuracies on all task slightly by modeling sentiment composition explicitly. The obvious promotion of BCN+LVG4 and BCN+LVeG shows that explicitly modeling sentiment composition with fine-grained sentiment subtypes is useful. Particularly, BCN+LVeG improves the sentence level classification accurracies by 1.4 points (fine-grained) and 0.7 points (binary) compared to BCN (our implementation), respectively. To our knowledge, we achieve the best results on the SST dataset.

5.5 Analysis

We make further analysis of our methods based on the constituent Tree-LSTM model. In the following, using WG, LVG and LVeG denote our three methods, respectively.

Impact on words and phrases

Figure 7 shows the accuracy improvements over ConTree on phrases of different heights. Here the height $h$ of a phrase in parse tree is defined as the distance between its corresponding constituent node and the deepest leaf node in its subtree. The improvement of our methods on word nodes, whose height is 0, is small because neural networks and word embeddings can already capture the emotion of words. In fact, the accuracy of ConTree on word nodes reaches 98.1%. As the height increases, the performance of our methods increase, expect for the accuracies of WG when $h\geq 10$ since the coarse-grained sentiment representation is far difficulty for handling too many sentiment compositions over the tree structure. The performance improvements of LVG4 and LVeG when $h\geq 10$ show modeling fine-grained sentiment signals can represent sentiment of higher phrases better.

Impact on sentiment polarities

Figure 8 shows the performance changes of our models over ConTree on different sentiment polarities. The accuracy of every sentiment polarity on WG over ConTree improves slightly. Compared with ConTree, the accuracies of LVG4 and LVeG on extreme sentiments (the strong negative and strong positive sentiments) receive significant improvement. In addition, the proportion of extreme emotions mis-classified as weak emotions (the negative and positive sentiments) drops dramatically. It indicates that LVG4 and LVeG can capture the subtle difference between extreme sentiments and weak sentiments by modeling sentiment subtypes explicitly.

Visualization of sentiment subtypes

To investigate whether our LVeG can accurately model different emotional subtypes, we visualize all the strong negative sentiment phrases with length below 6 that are classified correctly in a 2D space. Since in LVeG, 2-dimension 1-component Gaussian mixtures are used to model a distribution over subtypes of a specific sentiment of phrases, we directly represent phrases by their Gaussian means $\bm{\mu}$ . From Figure 9, we see that boring emotions such as “Extremely boring” and “boring” (green dots) are located at the bottom left, stupid emotions such as “stupider” and “Ridiculous” (red dots) are mainly located at the top right and negative emotions with no special emotional tendency such as “hate” and “bad” (blue dots) are evenly distributed throughout the space. This demonstrates that LVeG can capture sentiment subtypes.

6 Conclusion

We presented a range of sentiment grammars for using neural networks to model sentiment composition explicitly, and empirically showed that explicit modeling of sentiment composition with fine-grained sentiment subtypes gives better performance compared to state-of-the-art neural network models in sentiment analysis. By using EMLo embeddings, our final model improves fine-grained accuracies by 1.3 points compare to the current best result.

Acknowledgments

This work was supported by the Major Program of Science and Technology Commission Shanghai Municipal (17JC1404102) and NSFC (No. 61572245) . We would like to thank the anonymous reviewers for their careful reading and useful comments.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Choi and Cardie (2008) Yejin Choi and Claire Cardie. 2008. Learning with compositional semantics as structural inference for subsentential sentiment analysis. In Proceedings of the conference on Empirical Methods in Natural Language Processing , pages 793–801. Association for Computational Linguistics.
2Dong et al. (2015) Li Dong, Furu Wei, Shujie Liu, Ming Zhou, and Ke Xu. 2015. A statistical parsing framework for sentiment classification. Computational Linguistics , pages 265–308.
3Gupta and Zhang (2018) Amulya Gupta and Zhu Zhang. 2018. To attend or not to attend: A case study on syntactic structures for semantic relatedness. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics , pages 2116–2125. Association for Computational Linguistics.
4Johnson (1998) Mark Johnson. 1998. Pcfg models of linguistic tree representations. Computational Linguistics , pages 613–632.
5Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations .
6Le and Zuidema (2015) Phong Le and Willem Zuidema. 2015. Compositional distributional semantics with long short term memory. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics , pages 10–19. Association for Computational Linguistics.
7Li et al. (2015) Jiwei Li, Thang Luong, Dan Jurafsky, and Eduard Hovy. 2015. When are tree structures necessary for deep learning of representations? In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 2304–2314. Association for Computational Linguistics.
8Matsuzaki et al. (2005) Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2005. Probabilistic CFG with latent annotations. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics , pages 75–82, Ann Arbor, Michigan. Association for Computational Linguistics.