Composition of Sentence Embeddings:Lessons from Statistical Relational   Learning

Damien Sileo; Tim Van-De-Cruys; Camille Pradel; Philippe Muller

arXiv:1904.02464·cs.CL·April 5, 2019

Composition of Sentence Embeddings:Lessons from Statistical Relational Learning

Damien Sileo, Tim Van-De-Cruys, Camille Pradel, Philippe Muller

PDF

TL;DR

This paper explores the use of advanced statistical relational learning models for composing sentence embeddings, demonstrating that these models are more expressive and improve performance on relation prediction and sentence representation tasks.

Contribution

It reveals limitations of traditional composition methods in NLP and introduces SRL-based models that enhance expressiveness and accuracy in semantic relation tasks.

Findings

01

SRL-based compositions outperform simple methods in relation prediction

02

Advanced models significantly improve state-of-the-art results

03

Traditional compositions are insufficient for complex NLP tasks

Abstract

Various NLP problems -- such as the prediction of sentence similarity, entailment, and discourse relations -- are all instances of the same general task: the modeling of semantic relations between a pair of textual elements. A popular model for such problems is to embed sentences into fixed size vectors, and use composition functions (e.g. concatenation or sum) of those vectors as features for the prediction. At the same time, composition of embeddings has been a main focus within the field of Statistical Relational Learning (SRL) whose goal is to predict relations between entities (typically from knowledge base triples). In this article, we show that previous work on relation prediction between texts implicitly uses compositions from baseline SRL models. We show that such compositions are not expressive enough for several tasks (e.g. natural language inference). We build on recent SRL…

Tables6

Table 1. Table 1: Selected relational learning models. Unstructured is from (Bordes et al., 2013a ) , TransE from (Bordes et al., 2013b ) , RESCAL from (Nickel et al., 2011 ) , DistMult from (Yang et al., 2015 ) and (Trouillon et al., 2016 ) . Following the latter, < a , b , c > <a,b,c> denotes ∑ k a k b k c k . subscript 𝑘 subscript 𝑎 𝑘 subscript 𝑏 𝑘 subscript 𝑐 𝑘 \sum_{k}a_{k}b_{k}c_{k}. Re ( x ) Re 𝑥 \text{Re}(x) is the real part of x 𝑥 x , and p 𝑝 p is commonly set to 1 1 1 .

Model	Scoring function	Parameters
Unstructured	${‖ e_{1} - e_{2} ‖}_{p}$	-
TransE	${‖ e_{1} + w_{r} - e_{2} ‖}_{p}$	$w_{r} \in ℝ^{d}$
RESCAL	$e_{1}^{T} W_{r} e_{2}$	$W_{r} \in ℝ^{d^{2}}$
DistMult	$< e_{1}, w_{r}, e_{2} >$	$w_{r} \in ℝ^{d}$
ComplEx	$Re < e_{1}, w_{r}, \bar{e_{2}} >$	$w_{r} \in ℂ^{d}$

Table 2. Table 2: Transfer evaluation tasks. N = number of training examples; C = number of classes if applicable. h 1 , h 2 subscript ℎ 1 subscript ℎ 2 h_{1},h_{2} are sentence representations, f m , s subscript 𝑓 𝑚 𝑠 f_{m,s} a composition function from section 4 .

name	N	task	C	representation(s) used
MR	11k	sentiment (movies)	2	$h_{1}$
SUBJ	10k	subjectivity/objectivity	2	$h_{1}$
MPQA	11k	opinion polarity	2	$h_{1}$
TREC	6k	question-type	6	$h_{1}$
${SICK}_{s}^{m}$	10k	NLI	3	$f_{m, s} (h_{1}, h_{2})$
${MRPC}_{s}^{m}$	4k	paraphrase detection	2	$(f_{m, s} (h_{1}, h_{2}) + (f_{m, s} (h_{2}, h_{1})) / 2$
${PDTB}_{s}^{m}$	17k	discursive relation	5	$f_{m, s} (h_{1}, h_{2})$
STS14	4.5k	similarity	-	$cos (h_{1}, h_{2})$

Table 3. Table 3: SentEval and base task evaluation results for the models trained on natural language inference ( 𝒯 = 𝑁𝐿𝐼 𝒯 𝑁𝐿𝐼 \mathcal{T}=\mathit{NLI} ); AllNLI is used for training. All scores are accuracy percentages, except STS14, which is Pearson correlation percentage. AVG denotes the average of the SentEval scores.

Models trained on natural language inference ( $𝒯 = 𝑁𝐿𝐼$ )
m,s	MR	SUBJ	MPQA	TREC	${MRPC}_{-}^{⊙}$	${PDTB}_{-}^{⊙}$	${SICK}_{-}^{⊙}$	STS14	$𝒯$	AVG
$⊙, -$	81.2	92.7	90.4	89.6	76.1	46.7	86.6	69.5	84.2	79.1
$α, -$	81.4	92.8	90.5	89.6	75.4	46.6	86.7	69.5	84.3	79.1
$β, -$	81.2	92.6	90.5	89.6	76	46.5	86.6	69.5	84.2	79.1
$⊙, t$	81.1	92.7	90.5	89.7	76.5	46.4	86.5	70.0	84.8	79.2
$α, t$	81.3	92.6	90.6	89.2	76.2	47.2	86.5	70.0	84.6	79.2
$β, t$	81.2	92.7	90.4	88.5	75.8	47.3	86.8	69.8	84.2	79.1

Table 4. Table 4: SentEval and base task evaluation results for the models trained on discourse connective prediction ( 𝒯 = 𝐷𝑖𝑠𝑐 𝒯 𝐷𝑖𝑠𝑐 \mathcal{T}=\mathit{Disc} ). All scores are accuracy percentages, except STS14, which is Pearson correlation percentage. AVG denotes the average of the SentEval scores.

Models trained on discourse connective prediction ( $𝒯 = 𝐷𝑖𝑠𝑐$ )
m,s	MR	SUBJ	MPQA	TREC	${MRPC}_{-}^{⊙}$	${PDTB}_{-}^{⊙}$	${SICK}_{-}^{⊙}$	STS14	$𝒯$	AVG
$⊙, -$	80.4	92.7	90.2	89.5	74.5	47.3	83.2	57.9	35.7	77
$α, -$	80.4	92.9	90.2	90.2	75	47.9	83.3	57.8	35.9	77.2
$β, -$	80.2	92.8	90.2	88.4	74.9	47.5	82.9	57.7	35.9	76.8
$⊙, t$	80.2	92.8	90.2	90.4	74.6	48.5	83.4	58.6	36.1	77.3
$α, t$	80.2	92.9	90.3	90.3	75.1	47.8	83.2	58.3	36.1	77.3
$β, t$	80.2	92.8	90.3	89.7	74.4	47.9	83.7	58.2	35.7	77.2

Table 5. Table 5: Comparison models from previous work. InferSent represents the original results from Conneau et al. ( 2017 ) , SkipT is SkipThought from Kiros et al. ( 2015 ) , and BoW is our re-evaluation of GloVe Bag of Words from Conneau et al. ( 2017 ) . AVG denotes the average of the SentEval scores..

model	MR	SUBJ	MPQA	TREC	${MRPC}_{-}^{⊙}$	${PDTB}_{-}^{⊙}$	${SICK}_{-}^{⊙}$	STS14	AVG
Comparison models
Infersent	81.1	92.4	90.2	88.2	76.2	46.7-	86.3	70	78.9
SkipT	76.5	93.6	87.1	92.2	73	-	82.3	29	-
BoW	77.2	91.2	87.9	83	72.2	43.9	78.4	54.6	73.6

Table 6. Table 6: Results for sentence relation tasks using an alternative composition function ( f ℂ β , − subscript 𝑓 superscript ℂ 𝛽 f_{\mathbb{C}^{\beta},-} ) during evaluation. AVG denotes the average of the three tasks.

m,s	${MRPC}_{-}^{β}$	${PDTB}_{-}^{β}$	${SICK}_{-}^{β}$	AVG	${MRPC}_{-}^{β}$	${PDTB}_{-}^{β}$	${SICK}_{-}^{β}$	AVG
	$𝒯 = 𝐷𝑖𝑠𝑐$				$𝒯 = 𝑁𝐿𝐼$
$⊙, -$	74.8	48.2	83.6	68.9	76.2	47.2	86.9	70.1
$α, -$	74.9	49.3	83.8	69.3	75.9	47.1	86.9	70
$β, -$	75	48.8	83.4	69.1	75.8	47	87	69.9
$⊙, t$	74.9	48.7	83.6	69.1	76.2	47.8	86.8	70.3
$α, t$	75.2	48.6	83.5	69.1	76.2	47.6	87.3	70.4
$β, t$	74.6	48.9	83.9	69.1	76.2	47.8	87	70.3

Equations20

f_{[,]} (h_{1}, h_{2}) = [h_{1}, h_{2}]

f_{[,]} (h_{1}, h_{2}) = [h_{1}, h_{2}]

f_{⊙} (h_{1}, h_{2}) = h_{1} ⊙ h_{2}

f_{⊙} (h_{1}, h_{2}) = h_{1} ⊙ h_{2}

f_{-} (h_{1}, h_{2}) = ∣ h_{1} - h_{2} ∣

f_{-} (h_{1}, h_{2}) = ∣ h_{1} - h_{2} ∣

f_{⊙-} (h_{1}, h_{2}) = [h_{1} ⊙ h_{2}, ∣ h_{2} - h_{1} ∣]

f_{⊙-} (h_{1}, h_{2}) = [h_{1} ⊙ h_{2}, ∣ h_{2} - h_{1} ∣]

f_{\otimes} (h_{1}, h_{2}) = h_{1} \otimes h_{2} where (h_{1} \otimes h_{2})_{i, j} = h_{1 i} h_{2 j}

f_{\otimes} (h_{1}, h_{2}) = h_{1} \otimes h_{2} where (h_{1} \otimes h_{2})_{i, j} = h_{1 i} h_{2 j}

f_{t} (h_{1}, h_{2}) = ∣ h_{2} - h_{1} + t ∣, where t \in R^{d}

f_{t} (h_{1}, h_{2}) = ∣ h_{2} - h_{1} + t ∣, where t \in R^{d}

f_{C} (h_{1}, h_{2}) = [h_{1}^{R} ⊙ h_{2}^{R} + h_{1}^{I} ⊙ h_{2}^{I}, h_{1}^{R} ⊙ h_{2}^{I} - h_{1}^{I} ⊙ h_{2}^{R}]

f_{C} (h_{1}, h_{2}) = [h_{1}^{R} ⊙ h_{2}^{R} + h_{1}^{I} ⊙ h_{2}^{I}, h_{1}^{R} ⊙ h_{2}^{I} - h_{1}^{I} ⊙ h_{2}^{R}]

f_{C^{α}} (h_{1}, h_{2}) = [h_{1}^{R} ⊙ h_{2}^{R}, h_{1}^{I} ⊙ h_{2}^{I}, h_{1}^{R} ⊙ h_{2}^{I} - h_{1}^{I} ⊙ h_{2}^{R}]

f_{C^{α}} (h_{1}, h_{2}) = [h_{1}^{R} ⊙ h_{2}^{R}, h_{1}^{I} ⊙ h_{2}^{I}, h_{1}^{R} ⊙ h_{2}^{I} - h_{1}^{I} ⊙ h_{2}^{R}]

f_{C^{β}} (h_{1}, h_{2}) = [h_{1}^{R} ⊙ h_{2}^{R}, h_{1}^{I} ⊙ h_{2}^{I}, h_{1}^{R} ⊙ h_{2}^{I}, h_{1}^{I} ⊙ h_{2}^{R}]

f_{C^{β}} (h_{1}, h_{2}) = [h_{1}^{R} ⊙ h_{2}^{R}, h_{1}^{I} ⊙ h_{2}^{I}, h_{1}^{R} ⊙ h_{2}^{I}, h_{1}^{I} ⊙ h_{2}^{R}]

f_{m, s, 1, 2} (h_{1}, h_{2}) = [f_{m} (h_{1}, h_{2}), f_{s} (h_{1}, h_{2}), h_{1}, h_{2}]

f_{m, s, 1, 2} (h_{1}, h_{2}) = [f_{m} (h_{1}, h_{2}), f_{s} (h_{1}, h_{2}), h_{1}, h_{2}]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Composition of Sentence Embeddings:

Lessons from Statistical Relational Learning

Damien Sileo

IRIT, University of Toulouse, France

Synapse Développement, Toulouse, France

Tim Van De Cruys

IRIT, CNRS, France

Camille Pradel

Synapse Développement, Toulouse, France

Philippe Muller

IRIT, University of Toulouse, France

Abstract

Various NLP problems – such as the prediction of sentence similarity, entailment, and discourse relations – are all instances of the same general task: the modeling of semantic relations between a pair of textual elements. A popular model for such problems is to embed sentences into fixed size vectors, and use composition functions (e.g. concatenation or sum) of those vectors as features for the prediction. At the same time, composition of embeddings has been a main focus within the field of Statistical Relational Learning (SRL) whose goal is to predict relations between entities (typically from knowledge base triples). In this article, we show that previous work on relation prediction between texts implicitly uses compositions from baseline SRL models. We show that such compositions are not expressive enough for several tasks (e.g. natural language inference). We build on recent SRL models to address textual relational problems, showing that they are more expressive, and can alleviate issues from simpler compositions. The resulting models significantly improve the state of the art in both transferable sentence representation learning and relation prediction.

1 Introduction

Predicting relations between textual units is a widespread task, essential for discourse analysis, dialog systems, information retrieval, or paraphrase detection. Since relation prediction often requires a form of understanding, it can also be used as a proxy to learn transferable sentence representations. Several tasks that are useful to build sentence representations are derived directly from text structure, without human annotation: sentence order prediction (Logeswaran et al., 2016; Jernite et al., 2017), the prediction of previous and subsequent sentences (Kiros et al., 2015; Jernite et al., 2017), or the prediction of explicit discourse markers between sentence pairs (Nie et al., 2017; Jernite et al., 2017). Human labeled relations between sentences can also be used for that purpose, e.g. inferential relations (Conneau et al., 2017). While most work on sentence similarity estimation, entailment detection, answer selection, or discourse relation prediction seemingly uses task-specific models, they all involve predicting whether a relation $R$ holds between two sentences $s_{1}$ and $s_{2}$ . This genericity has been noticed in the literature before (Baudiš et al., 2016) and it has been leveraged for the evaluation of sentence embeddings within the SentEval framework (Conneau et al., 2017).

A straightforward way to predict the probability of $(s_{1},R,s_{2})$ being true is to represent $s_{1}$ and $s_{2}$ with $d$ -dimensional embeddings $h_{1}$ and $h_{2}$ , and to compute sentence pair features $f(h_{1},h_{2})$ , where $f$ is a composition function (e.g. concatenation, product, …). A softmax classifier $g_{\theta}$ can learn to predict $R$ with those features. $g_{\theta}\circ f$ can be seen as a reasoning based on the content of $h_{1}$ and $h_{2}$ (Socher et al., 2013).

Our contributions are as follows:

–

we review composition functions used in textual relational learning and show that they lack expressiveness (section 2);

–

we draw analogies with existing SRL models (section 3) and design new compositions inspired from SRL (section 4);

–

we perform extensive experiments to test composition functions and show that some of them can improve the learning of representations and their downstream uses (section 6).

2 Composition functions for relation prediction

We review here popular composition functions used for relation prediction based on sentence embeddings. Ideally, they should simultaneously fulfill the following minimal requirements:

–

make use of interactions between representations of sentences to relate;

–

allow for the learning of asymmetric relations (e.g. entailment, order);

–

be usable with high dimensionalities (parameters $\theta$ and $f$ should fit in GPU memory).

Additionally, if the main goal is transferable sentence representation learning, compositions should also incentivize gradually changing sentences to lie on a linear manifold, since transfer usually uses linear models. Another goal can be learning of transferable relation representation. Concretely, a sentence encoder and $f$ can be trained on a base task, and $f(h_{1},h_{2})$ can be used as features for transfer in another task. In that case, the geometry of the sentence embedding space is less relevant, as long as the $f(h_{1},h_{2})$ space works well for transfer learning. Our evaluation will cover both cases.

A straightforward instantiation of $f$ is concatenation (Hooda & Kosseim, 2017):

[TABLE]

However, interactions between $s_{1}$ and $s_{2}$ cannot be modeled with $f_{[,]}$ followed by a softmax regression. Indeed, $f_{[,]}(h_{1},h_{2})\theta$ can be rewritten as a sum of independent contributions from $h_{1}$ and $h_{2}$ , namely $\theta_{[0:d]}h_{1}+\theta_{[d:2d]}h_{2}$ . Using a multi-layer perceptron before the softmax would solve this issue, but it also harms sentence representation learning (Conneau et al., 2017; Logeswaran & Lee, 2018), possibly because the perceptron allows for accurate predictions even if the sentence embeddings lie in a convoluted space. To promote interactions between $h_{1}$ and $h_{2}$ , element-wise product has been used in Baudiš et al. (2016):

[TABLE]

Absolute difference is another solution for sentence similarity (Mueller & Thyagarajan, 2016), and its element-wise variation may equally be used to compute informative features:

[TABLE]

The latter two were combined into a popular instantiation, sometimes refered as heuristic matching (Tai et al., 2015; Kiros et al., 2015; Mou et al., 2015):

[TABLE]

Although effective for certain similarity tasks, $f_{\odot-}$ is symmetrical, and should be a poor choice for tasks like entailment prediction or prediction of discourse relations. For instance, if $R_{e}$ denotes entailment and $(s_{1},s_{2})$ = (“It just rained”, “The ground is wet”), $(s_{1},R_{e},s_{2})$ should hold but not $(s_{2},R_{e},s_{1})$ . The $f_{\odot-}$ composition function is nonetheless used to train/evaluate models on entailment (Conneau et al., 2017) or discourse relation prediction (Nie et al., 2017).

Sometimes $[h_{1},h_{2}]$ is concatenated to $f_{\odot-}(h_{1},h_{2})$ (Ampomah et al., 2016; Conneau et al., 2017). While the resulting composition is asymmetrical, the asymmetrical component involves no interaction as noted previously. We note that this composition is very commonly used. On the SNLI benchmark,111nlp.stanford.edu/projects/snli/, as of February 2019. $12$ out of the $25$ listed sentence embedding based models use it, and $7$ use a weaker form (e.g. omitting $f_{\odot}$ ).

The outer product $\otimes$ has been used instead for asymmetric multiplicative interaction (Jernite et al., 2017):

[TABLE]

This formulation is expressive but it forces $g_{\theta}$ to have $d^{2}$ parameters per relation, which is prohibitive when there are many relations and $d$ is high.

The problems outlined above are well known in SRL. Thus, existing compositions (except $f_{\otimes}$ ) can only model relations superficially for tasks currently used to train state of the art sentence encoders, like NLI or discourse connectives prediction.

3 Statistical Relational Learning models

In this section we introduce the context of statistical relational learning (SRL) and relevant models. Recently, SRL has focused on efficient and expressive relation prediction based on embeddings. A core goal of SRL (Getoor & Taskar, 2007) is to induce whether a relation $R$ holds between two arbitrary entities $e_{1},e_{2}$ . As an example, we would like to assign a score to $(e_{1},R,e_{2})$ = (Paris, located_in, France) that reflects a high probability. In embedding-based SRL models, entities $e_{i}$ have vector representations in $\mathbb{R}^{d}$ and a scoring function reflects truth values of relations. The scoring function should allow for relation-dependent reasoning over the latent space of entities. Scoring functions can have relation-specific parameters, which can be interpreted as relation embeddings. Table 1 presents an overview of a number of state of the art relational models. We can distinguish two families of models: subtractive and multiplicative.

The TransE scoring function is motivated by the idea that translations in latent space can model analogical reasoning and hierarchical relationships. Dense word embeddings trained on tasks related to the distributional hypothesis naturally allow for analogical reasoning with translations without explicit supervision (Mikolov et al., 2013). TransE generalizes the older Unstructured model. We call them subtractive models.

The RESCAL, Distmult, and ComplEx scoring functions can be seen as dot product matching between $e_{1}$ and a relation-specific linear transformation of $e_{2}$ (Liu et al., 2017). This transformation helps checking whether $e_{1}$ matches with some aspects of $e_{2}$ . RESCAL allows a full linear mapping $W_{r}e_{2}$ but has a high complexity, while Distmult is restricted to a component-wise weighting $w_{r}\odot e_{2}$ . ComplEx has fewer parameters than RESCAL but still allows for the modeling of asymmetrical relations. As shown in Liu et al. (2017), ComplEx boils down to a restriction of RESCAL where $W_{r}$ is a block diagonal matrix. These blocks are 2-dimensional, antisymmetric and have equal diagonal terms. Using such a form, even and odd indexes of $e$ ’s dimensions play the roles of real and imaginary numbers respectively. The ComplEx model (Trouillon et al., 2016) and its variations (Lacroix et al., 2018) yield state of the art performance on knowledge base completion on numerous evaluations.

4 Embeddings composition as SRL models

We claim that several existing models (Conneau et al., 2017; Nie et al., 2017; Baudiš et al., 2016) boil down to SRL models where the sentence embeddings ( $h_{1},h_{2})$ act as entity embeddings ( $e_{1},e_{2}$ ). This framework is depicted in figure 1. In this article we focus on sentence embeddings, although our framework can straightforwardly be applied to other levels of language granularity (such as words, clauses, or documents).

Some models (Chen et al., 2017b; Seo et al., 2016; Gong et al., 2018; Radford, 2018; Devlin et al., 2018) do not rely on explicit sentence encodings to perform relation prediction. They combine information of input sentences at earlier stages, using conditional encoding or cross-attention. There is however no straightforward way to derive transferable sentence representations in this setting, and so these models are out of the scope of this paper. They sometimes make use of composition functions, so our work could still be relevant to them in some respect.

In this section we will make a link between sentence composition functions and SRL scoring functions, and propose new scoring functions drawing inspiration from SRL.

4.1 Linking composition functions and SRL models

The composition function $f_{\odot}$ from equation 2 followed by a softmax regression yields a score whose analytical form is identical to the Distmult model score described in section 3. Let $\theta_{R}$ denote the softmax weights for relation $R$ . The logit score for the truth of $(s_{1},R,s_{2})$ is $f(h_{1},h_{2})\theta_{R}=(h_{1}\odot h_{2})\theta_{R}$ which is equal to the Distmult scoring function $<h_{1},\theta_{R},h_{2}>$ if $h_{1},h_{2}$ act as entities embeddings and $\theta_{R}$ as the relation weight $w_{R}$ .

Similarly, the composition $f_{-}$ from equation 3 followed by a softmax regression can be seen as an element-wise weighted score of Unstructured (both are equal if softmax weights are all unitary).

Thus, $f_{\odot-}$ from 4 (with softmax regression) can be seen as a weighted ensemble of Unstructured and Distmult. These two models are respectively outperformed by TransE and ComplEx on knowledge base link prediction by a large margin (Trouillon et al., 2016; Bordes et al., 2013a). We therefore propose to change the Unstructured and Distmult in $f_{\odot-}$ such that they match their respective state of the art variations in the following sections. We will also show the implications of these refinements.

4.2 Casting TransE as a composition

Simply replacing $|h_{2}-h_{1}|$ with

[TABLE]

would make the model analogous to TransE. $t$ is learned and is shared by all relations. A relation-specific translation $t_{R}$ could be used but it would make $f$ relation-specific. Instead, here, each dimension of $f_{t}(h_{1},h_{2})$ can be weighted according to a given relation. Non-zero $t$ makes $f_{t}$ asymmetrical and also yields features that allow for the checking of an analogy between $s_{1}$ and $s_{2}$ . Sentence embeddings often rely on pre-trained word embeddings which have demonstrated strong capabilities for analogical reasoning. Some analogies, such as part-whole, are computable with off-the-shelf word embeddings (Chen et al., 2017a) and should be very informative for natural language inference tasks. As an illustration, let us consider an artificial semantic space (depicted in figures 2(a) and 2(b)) where we posit that there is a “to the past” translation $t$ so that $h_{1}+t$ is the embedding of a sentence $s_{1}$ changed to the past tense. Unstructured is not able to leverage this semantic space to correctly score $(s_{1},R_{to\_the\_past},s_{2}$ ) while TransE is well tailored to provide highest scores for sentences near $h_{1}+\hat{t}$ where $\hat{t}$ is an estimation of $t$ that could be learned from examples.

4.3 Casting ComplEx as a composition

Let us partition $h$ dimensions into two equally sized sets $\mathcal{R}$ and $\mathcal{I}$ , e.g. even and odd dimension indices of $h$ . We propose a new function $f_{\mathbb{C}}$ as a way to fit the ComplEx scoring function into a composition function.

[TABLE]

$f_{\mathbb{C}}(h_{1},h_{2})$ multiplied by softmax weights $\theta_{r}$ is equivalent to the ComplEx scoring function $\text{Re}<h_{1},\theta_{r},\overline{h_{2}}>$ . The first half of $\theta_{r}$ weights corresponds to the real part of ComplEx relation weights while the last half corresponds to the imaginary part.

$f_{\mathbb{C}}$ is to the ComplEx scoring function what $f_{\odot}$ is to the DistMult scoring function. Intuitively, ComplEx is a minimal way to model interactions between distinct latent dimensions while Distmult only allows for identical dimensions to interact.

Let us consider a new artificial semantic space (shown in figures 2(c) and 2(d)) where the first dimension is high when a sentence means that it just rained, and the second dimension is high when the ground is wet. Over this semantic space, Distmult is only able to detect entailment for paraphrases whereas ComplEx is also able to naturally model that (“it just rained”, $R_{entailment}$ , “the ground is wet”) should be high while its converse should not.

We also propose two more general versions of $f_{\mathbb{C}}$ :

[TABLE]

$f_{\mathbb{C^{\alpha}}}$ can be seen as Distmult concatenated with the asymmetrical part of ComplEx and $f_{\mathbb{C^{\beta}}}$ can be seen as RESCAL with unconstrained block diagonal relation matrices.

5 On the evaluation of relational models

The SentEval framework (Conneau et al., 2017) provides a general evaluation for transferable sentence representations, with open source evaluation code. One only needs to specify a sentence encoder function, and the framework performs classification tasks or relation prediction tasks using cross-validated logistic regression on embeddings or composed sentence embeddings. Tasks include sentiment analysis, entailment, textual similarity, textual relatedness, and paraphrase detection. These tasks are a rich way to train or evaluate sentence representations since in a triple $(s_{1},R,s_{2})$ , we can see $(R,s_{2})$ as a label for $s_{1}$ (Baudiš et al., 2016). Unfortunately, the relational tasks hard-code the composition function from equation 4. From our previous analysis, we believe this composition function favors the use of contextual/lexical similarity rather than high-level reasoning and can penalize representations based on more semantic aspects. This bias could harm research since semantic representation is an important next step for sentence embedding. Training/evaluation datasets are also arguably flawed with respect to relational aspects since several recent studies (Dasgupta et al., 2018; Poliak et al., 2018; Gururangan et al., 2018; Glockner et al., 2018) show that InferSent, despite being state of the art on SentEval evaluation tasks, has poor performance when dealing with asymmetrical tasks and non-additive composition of words. In addition to providing new ways of training sentence encoders, we will also extend the SentEval evaluation framework with a more expressive composition function when dealing with relational transfer tasks, which improves results even when the sentence encoder was not trained with it.

6 Experiments

Our goal is to show that transferable sentence representation learning and relation prediction tasks can be improved when our expressive compositions are used instead of the composition from equation 4. We train our relational model adaptations on two relation prediction base tasks ( $\mathcal{T}$ ), one supervised ( $\mathcal{T}=\mathit{NLI}$ ) and one unsupervised ( $\mathcal{T}=\mathit{Disc}$ ) described below, and evaluate sentence/relation representations on base and transfer tasks using the SentEval framework in order to quantify the generalization capabilities of our models. Since we use minor modifications of InferSent and SentEval, our experiments are easily reproducible.

6.1 Training tasks

Natural language inference ( $\mathcal{T}$ = NLI)’s goal is to predict whether the relation between two sentences (premise and hypothesis) is Entailment, Contradiction or Neutral. We use the combination of SNLI dataset (Bowman et al., 2015) and MNLI dataset (Williams et al., 2017). We call AllNLI the resulting dataset of $1M$ examples. Conneau et al. (2017) claim that NLI data allows universal sentence representation learning. They used the $f_{\odot,-}$ composition function with concatenated sentence representations in order to train their Infersent model.

We also train on the prediction of discourse connectives between sentences/clauses ( $\mathcal{T}$ = Disc). Discourse connectives make discourse relations between sentences explicit. In the sentence I live in Paris but I’m often elsewhere, the word but highlights that there is a contrast between the two clauses it connects. We use Malmi et al.’s (2017) dataset of selected $400k$ instances with $20$ discourse connectives (e.g. however, for example) with the provided train/dev/test split. This dataset has no other supervision than the list of 20 connectives. Nie et al. (2017) used $f_{\odot,-}$ concatenated with the sum of sentence representations to train their model, DisSent, on a similar task and showed that their encoder was general enough to perform well on SentEval tasks. They use a dataset that is, at the time of writing, not publicly available.

6.2 Evaluation tasks

Table 2 provides an overview of different transfer tasks that will be used for evaluation. We added another relation prediction task, the PDTB coarse-grained implicit discourse relation task, to SentEval. This task involves predicting a discursive link between two sentences among $\{$ Comparison, Contingency, Entity based coherence, Expansion, Temporal $\}$ . We followed the setup of Pitler et al. (2009), without sampling negative examples in training. MRPC, PDTB and SICK will be tested with two composition functions: besides SentEval composition $f_{\odot,-}$ , we will use $f_{\mathcal{C}^{\beta},-}$ for transfer learning evaluation, since it has the most general multiplicative interaction and it does not penalize models that do not learn a translation. For all tasks except STS14, a cross-validated logistic regression is used on the sentence or relation representation. The evaluation of the STS14 task relies on Pearson or Spearman correlation between cosine similarity and the target. We force the composition function to be symmetrical on the MRPC task since paraphrase detection should be invariant to permutation of input sentences.

6.3 Setup

We want to compare the different instances of $f$ . We follow the setup of Infersent (Conneau et al., 2017): we learn to encode sentences into $h$ with a bi-directional LSTM using element-wise max pooling over time. The dimension size of $h$ is $4096$ . Word embeddings are fixed GloVe with 300 dimensions, trained on Common Crawl 840B.222https://nlp.stanford.edu/projects/glove/ Optimization is done with SGD and decreasing learning rate until convergence.

The only difference with regard to Infersent is the composition. Sentences are composed with six different compositions for training according to the following template:

[TABLE]

$f_{s}$ (subtractive interaction) is in $\{f_{-},f_{t}\}$ , $f_{m}$ (multiplicative interaction) is in $\{{f_{\odot}},f_{\mathbb{C}^{\alpha}},f_{\mathbb{C}^{\beta}}\}$ . We do not consider $f_{\mathbb{C}}$ since it yielded inferior results in our early experiments using NLI and SentEval development sets.

$f_{m,s,1,2}(h_{1},h_{2})$ is fed directly to a softmax regression. Note that Infersent uses a multi-layer perceptron before the softmax, but uses only linear activations, so $f_{\odot,-,1,2}(h_{1},h_{2})$ is analytically equivalent to Infersent when $\mathcal{T}=\mathit{NLI}$ .

6.4 Results

Having run several experiments with different initializations, the standard deviations between them do not seem to be negligible. We decided to take these into account when reporting scores, contrary to previous work (Kiros et al., 2015; Conneau et al., 2017): we average the scores of 6 distinct runs for each task and use standard deviations under normality assumption to compute significance. Table 3 shows model scores for $\mathcal{T}=\mathit{NLI}$ , while Table 4 shows scores for $\mathcal{T}=Disc$ . For comparison, Table 5 shows a number of important models from previous work. Finally, in Table 6, we present results for sentence relation tasks that use an alternative composition function ( $f_{\mathbb{C}^{\beta},-}$ ) instead of the standard composition function used in SentEval.

For sentence representation learning, the baseline, $f_{\odot}-$ composition already performs rather well, being on par with the InferSent scores of the original paper, as would be expected. However, macro-averaging all accuracies, it is the second worst performing model. $f_{\mathbb{C}^{\alpha},t,1,2}$ is the best performing model, and all three best models use the translation ( $s=t$ ). On relational transfer tasks, training with $f_{\mathbb{C}^{\alpha},t,1,2}$ and using complex $\mathbb{C}^{\beta}$ for transfer (Table 6) always outperforms the baseline ( $f_{\odot,-,1,2}$ with $\odot-$ composition in Tables 3 and 4). Averaging accuracies of those transfer tasks, this result is significant for both training tasks at level $p<0.05$ (using Bonferroni correction accounting for the 5 comparisons). On base tasks and the average of non-relational transfer tasks (MR, MPQA, SUBJ, TREC), our proposed compositions are on average slightly better than $f_{\odot,-,1,2}$ . Representations learned with our proposed compositions can still be compared with simple cosine similarity: all three methods using the translational composition ( $s=t$ ) very significantly outperform the baseline (significant at level $p<0.01$ with Bonferroni correction) on STS14 for $\mathcal{T}=\mathit{NLI}$ . Thus, we believe $f_{\mathcal{C}^{\alpha},t,1,2}$ has more robust results and could be a better default choice than $f_{\odot,-,1,2}$ as composition for representation learning. 333Note that our compositions are also beneficial with regard to convergence speed: on average, each of our proposed compositions needed less epochs to converge than the baseline $f_{\odot,-,1,2}$ , for both training tasks.

Additionally, using $\mathbb{C}^{\beta}$ (Table 6) instead of $\odot$ (Tables 3 and 4) for transfer learning in relational transfer tasks (PDTB, MRPC, SICK) yields a significant improvement on average, even when $m=\odot$ was used for training ( $p<0.001$ ). Therefore, we believe $f_{\mathbb{C}^{\beta},-}$ is an interesting composition for inference or evaluation of models regardless of how they were trained.

7 Related work

There are numerous interactions between SRL and NLP. We believe that our framework merges two specific lines of work: relation prediction and modeling textual relational tasks.

Some previous NLP work focused on composition functions for relation prediction between text fragments, even though they ignored SRL and only dealt with word units. Word2vec (Mikolov et al., 2013) has sparked a great interest for this task with word analogies in the latent space. Levy & Goldberg (2014) explored different scoring functions between words, notably for analogies. Hypernymy relations were also studied, by Chang et al. (2017) and Fu et al. (2014). Levy et al. (2015) proposed tailored scoring functions. Even the skipgram model (Mikolov et al., 2013) can be formulated as finding relations between context and target words. We did not empirically explore textual relational learning at the word level, but we believe that it would fit in our framework, and could be tested in future studies. Numerous approaches (Chen et al., 2017b; Seok et al., 2016; Gong et al., 2018; Joshi et al., 2018) were proposed to predict inference relations between sentences, but don’t explicitely use sentence embeddings. Instead, they encode sentences jointly, possibly with the help of previously cited word compositions, therefore it would also be interesting to try applying our techniques within their framework.

Some modeling aspects of textual relational learning have been formally investigated by Baudiš et al. (2016). They noticed the genericity of relational problems and explored multi-task and transfer learning on relational tasks. Their work is complementary to ours since their framework unifies tasks while ours unifies composition functions. Subsequent approaches use relational tasks for training and evaluation on specific datasets (Conneau et al., 2017; Nie et al., 2017).

8 Conclusion

We have demonstrated that a number of existing models used for textual relational learning rely on composition functions that are already used in Statistical Relational Learning. By taking into account previous insights from SRL, we proposed new composition functions and evaluated them. These composition functions are all simple to implement and we hope that it will become standard to try them on relational problems. Larger scale data might leverage these more expressive compositions, as well as more compositional, asymmetric, and arguably more realistic datasets (Dasgupta et al., 2018; Gururangan et al., 2018). Finally, our compositions can also be helpful to improve interpretability of embeddings, since they can help measure relation prediction asymmetry. Analogies through translations helped interpreting word embeddings, and perhaps anlyzing our learned $t$ translation could help interpreting sentence embeddings.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ampomah et al. (2016) Isaac K E Ampomah, Seong-bae Park, and Sang-jo Lee. A Sentence-to-Sentence Relation Network for Recognizing Textual Entailment. World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering , 10(12):1955–1958, 2016.
2Baudiš et al. (2016) Petr Baudiš, Jan Pichl, Tomáš Vyskočil, and Jan Šedivý. Sentence Pair Scoring: Towards Unified Framework for Text Comprehension. 2016. URL http://arxiv.org/abs/1603.06127 .
3Bordes et al. (2013 a) Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. A Semantic Matching Energy Function for Learning with Multi-relational Data. Machine Learning , 2013 a. ISSN 0885-6125. doi: 10.1007/s 10994-013-5363-6 . URL http://arxiv.org/abs/1301.3485 .
4Bordes et al. (2013 b) Antoine Bordes, Nicolas Usunier, Jason Weston, and Oksana Yakhnenko. Translating Embeddings for Modeling Multi-Relational Data. Advances in NIPS , 26:2787–2795, 2013 b. ISSN 10495258. doi: 10.1007/s 13398-014-0173-7.2 .
5Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,Lisbon, Portugal, 17-21 September 2015 , (September):632–642, 2015. ISSN 9781941643327.
6Chang et al. (2017) Haw-Shiuan Chang, Zi Yun Wang, Luke Vilnis, and Andrew Mc Callum. Unsupervised Hypernym Detection by Distributional Inclusion Vector Embedding. 2017. URL http://arxiv.org/abs/1710.00880 .
7Chen et al. (2017 a) Dawn Chen, Joshua C. Peterson, and Thomas L. Griffiths. Evaluating vector-space models of analogy. Co RR , abs/1705.04416, 2017 a.
8Chen et al. (2017 b) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm for natural language inference. In Regina Barzilay and Min-Yen Kan (eds.), ACL (1) , pp. 1657–1668. Association for Computational Linguistics, 2017 b. ISBN 978-1-945626-75-3. URL http://dblp.uni-trier.de/db/conf/acl/acl 2017-1.html#Chen ZLWJI 17 .